Guide: Installation Testing with cryosparcm test

This guide covers how to use cryosparcm test to verify your CryoSPARC installation is working properly.

The information in this section applies to CryoSPARC v4.0+.

Overview

After installing CryoSPARC using the instructions here, you can verify your instance is correctly installed by using cryosparcm test install and cryosparcm test workers via the command line. Running these functions will perform several tests that ensure users can seamlessly launch jobs and process data in CryoSPARC.

`cryosparcm test install`

This function tests the core components of CryoSPARC (HTTP connections, licensing, workers, etc.) that are required to start running jobs and provides information on the status of the CryoSPARC instance (e.g., which version is running, whether a patch is available, etc.).

To run this function, log into a shell on the master node as the user that owns the CryoSPARC instance.

Run cryosparcm test -h for full usage instructions.

Example Output

cryosparcuser@uoft ~/ $ cryosparcm test i
Running installation tests...
✓ Running as cryoSPARC owner
✓ Running on master node
✓ CryoSPARC is running
✓ Connected to command_core at http://uoft:63582
✓ CRYOSPARC_LICENSE_ID environment variable is set
✓ License has correct format
✓ Insecure mode is disabled
✓ License server set to "https://get.cryosparc.com"
✓ Connection to license server succeeded
✓ License server returned success status code 200
✓ License server returned valid JSON response
✓ License exists and is valid
✓ CryoSPARC is running v4.0.0
✓ Running the latest version of cryoSPARC
✓ Patch update not required
✓ Admin user has been created
✓ GPU worker connected

Test Checklist

Running cryosparcm test install will test or check the following components:

Test if the cryosparcm test install command is running as the user who owns the CryoSPARC instance.
Test if the cryosparcm test install command is running on the machine that runs the CryoSPARC master instance.
Check if the CryoSPARC instance is turned on.
- If this test fails, turn on CryoSPARC by running cryosparcm start and run the command again.
Test if an HTTP connection can be successfully created to the command_core (CRYOSPARC_BASE_PORT+2) server.
- If this test fails, ensure a firewall isn’t blocking access to the ten consecutive ports from CRYOSPARC_BASE_PORT (default 39000, e.g., 39000-39010). For more information, see Open TCP Ports in the Guide.
Check if the environment variable CRYOSPARC_LICENSE_ID is set.
Test if the CryoSPARC License ID is in the correct format.
1. If this test fails, ensure the CryoSPARC License ID found in cryosparc_master/config.sh is set to the correct license ID.
Check if insecure request mode is enabled or disabled.
1. This option is controlled by the CRYOSPARC_INSECURE environment variable found in cryosparc_master/config.sh.
2. Enabling this option ignores SSL certificate errors when connecting to HTTPS endpoints. This is useful if you are behind an enterprise network using SSL injection.
Check if the URL to the license server is valid.
1. The URL can be overridden by the CRYOSPARC_LICENSE_SERVER_ADDR environment variable found in cryosparc_master/config.sh.
2. The default URL ishttps://get.cryosparc.com
Check if the CryoSPARC License ID being used is active.
1. If this test fails, either the CryoSPARC instance wasn’t able to connect to the licensing server, the license isn’t active, or there was a network partition causing data corruption (in which case, trying the command again in a few minutes may fix the issue).
2. If the instance is having trouble connecting to the licensing server, see License Server Troubleshooting in the Guide. Additionally, if your network is behind an HTTP proxy, see Custom SSL Certificate Authority Bundle in the Guide.
3. If the license being used is no longer active, request a new CryoSPARC License ID. See Obtaining A License ID in the Guide.
Check the current running version of the CryoSPARC instance.
1. See the CryoSPARC Changelog.
Check if there is an update available for CryoSPARC.
1. To update CryoSPARC, run cryosparcm update. For more information, see Software Updates and Patches in the Guide.
Check if there is a patch update available for CryoSPARC.
1. To patch CryoSPARC, run cryosparcm patch. For more information, see Apply Patches in the Guide.
Check if a worker is connected with at least one GPU.
1. To add a GPU worker to CryoSPARC, see Connecting A Worker Node in the Guide.

`cryosparcm test workers`

This function tests workers connected to CryoSPARC to ensure they can correctly run CryoSPARC jobs by testing if the worker can launch jobs, cache particles to an SSD (if an SSD is configured), and utilize the GPU correctly. This test can be run via the command line, or directly in the CryoSPARC user interface. Three new jobs have been added to CryoSPARC that can be run at any time on the lane you’d like to test.

Usage

Run cryosparcm test -h for full usage instructions.

The tests require a project to be run inside. If there are no projects in the instance, create one before running this function.

To run all tests on all workers:

run cryosparcm test workers <project_uid>
(e.g., cryosparcm test workers P1)

To run only the GPU test on all workers:

run cryosparcm test workers <project_uid> --test gpu
(e.g., cryosparcm test workers P1 --test gpu).

To run only the GPU test on a single worker:

run cryosparcm test workers <project_uid> --test gpu --target <workstation_hostname>
(e.g., cryosparcm test workers P1 --test gpu --target cryoem1.uoft.ca)

To run only the GPU test (with Tensorflow and PyTorch) on a single worker:

run cryosparcm test workers <project_uid> --test gpu --test-tensorflow --test-pytorch --target <workstation_hostname>
(e.g., cryosparcm test workers P1 --test gpu --test-tensorflow --test-pytorch --target cryoem1.uoft.ca)

To run only the GPU test on two workers:

run cryosparcm test workers <project_uid> --test gpu --target <workstation1_hostname> --target <workstation2_hostname>
(e.g., cryosparcm test workers P1 --test gpu --target cryoem1.uoft.ca --target cryoem2.uoft.ca)

Example Output

Some text removed for readability.

cryossparcuser@uoft ~/ $ cryosparcm test workers P1
Using project P1
Running worker tests...
Worker test results
cryoem3
  ✓ LAUNCH
  ✓ SSD
  ✓ GPU
cryoem2
  ✓ LAUNCH
  ✓ SSD
  ✓ GPU
cryoem5
  ✓ LAUNCH
  ✕ SSD
    Error: [Errno 13] Permission denied: '/scratch'
    See P1 J1211 for more information
  ⚠ GPU
    No GPU available
cryoem6
  ✕ LAUNCH
    Error: 
    See P1 J1203 for more information
  ⚠ SSD
    Did not run: Launch test failed
  ⚠ GPU
    Did not run: Launch test failed
cryoem1
  ✓ LAUNCH
  ✓ SSD
  ✓ GPU
    ⚠ RTX A6000 @ 00000000:03:00.0: Persistence Mode is Disabled. 
      Enable Persistence mode by running `nvidia-smi -pm 1` as root to persist 
      the NVIDIA driver, reducing GPU load times.
    ⚠ RTX A6000 @ 00000000:03:00.0: GPU Software Power Cap is Active
    ⚠ RTX A6000 @ 00000000:21:00.0: Persistence Mode is Disabled. 
      Enable Persistence mode by running `nvidia-smi -pm 1` as root to persist 
      the NVIDIA driver, reducing GPU load times.
    ⚠ RTX A6000 @ 00000000:21:00.0: GPU Software Power Cap is Active
    ⚠ GeForce RTX 3090 @ 00000000:4C:00.0: Persistence Mode is Disabled. 
      Enable Persistence mode by running `nvidia-smi -pm 1` as root to persist 
      the NVIDIA driver, reducing GPU load times.
cryoem7
  ✓ LAUNCH
  ✓ SSD
  ✓ GPU
cryoem9
  ✓ LAUNCH
  ✓ SSD
  ✕ GPU
    Error: Tensorflow detected 0 of 7 GPUs.
    See P1 J1222 for more information
cryoem10
  ✓ LAUNCH
  ✓ SSD
  ✓ GPU

When the worker test is run, a new workspace inside the specified project will be created to contain all test jobs. The workspace will be named with the date and time (UTC) of execution.

If a test job fails, check the job's Event Log and stdout log (joblog) for more details.

Launch Test

The ability to launch jobs will be tested first. This will indicate if the worker is accessible and can correctly run CryoSPARC jobs. If this test fails, it most likely indicates a connection issue between the master and the worker. For more information, see Cannot Queue or Run Job in the Guide.

Note that if a launch test fails on a worker, the SSD and GPU tests will not run:

cryoem6
  ✕ LAUNCH
    Error: ssh: connect to host cryoem6 port 22: No route to host
    See P1 J1203 for more information
  ⚠ SSD
    Did not run: Launch test failed
  ⚠ GPU
    Did not run: Launch test failed

SSD Test

If an SSD is configured for a worker, the SSD test will confirm that particle caching is working properly. The test creates five different particle stacks of shape (500, 512, 512) in the project directory, and tries to cache them to the SSD.

Testing SSD

Generating a 500 particle stack with shape (512, 512).

Writing particle stack 1/5... Done in 1.517s
Writing particle stack 2/5... Done in 1.329s.
Writing particle stack 3/5... Done in 1.290s.
Writing particle stack 4/5... Done in 1.221s.
Writing particle stack 5/5... Done in 1.219s.

Loading a ParticleStack with 5 items...
 SSD cache : cache successfully synced in_use
 SSD cache : cache successfully synced, found 233127.92MB of files on SSD.
 SSD cache : cache successfully requested to check 5 files.
 SSD cache : cache requires 2500.00MB more on the SSD for files to be downloaded.
 SSD cache : cache has enough available space.

 Transferring J33/data/simulated_particles_4.mrc (500 MB) (5/5)
  Complete      :         2500 MB (100.00%)
  Total         :         2500 MB
  Current Speed :    1133.22 MB/s
  Average Speed :    1089.09 MB/s
  ETA           :      0h  0m  0s

 SSD cache : complete, all requested files are available on SSD.
Done.

Cleaning up testing data...
SSD Test completed successfully.

If an SSD Test fails for any reason, the reason will be summarized in the test results:

cryoem5
  ✓ LAUNCH
  ✕ SSD
    Error: [Errno 13] Permission denied: '/scratch'
    See P1 J1211 for more information
  ⚠ GPU
    No GPU available

For more information on configuring and troubleshooting an SSD cache for a worker, see SSD Particle Caching in CryoSPARC in the Guide.

GPU Test

The GPU test will collect information about all the GPUs on the worker and test if the worker can compile and run GPU code.

The following information is collected about each GPU via nvidia-smi:

driver_version: GPU driver version
- keeping the driver up to date ensures stability
- NVIDIA Driver Downloads
persistence_mode: GPU driver persistence
- NVIDIA Docs: Driver Persistence
- enabling this reduces GPU driver load times
- enable this by running nvidia-smi --pm 1 as root
power_limit: GPU power limit (TDP)
- information only
sw_power_limit: software power limiter
- if “Active”, this might indicate the power supply unit (PSU) on the workstation isn’t able to support the power draw from the GPU, or if a power supply cable is faulty or not properly connected to the GPU
- if “Active”, this might indicate the GPU temperature is too high
hw_power_limit: hardware power limiter
- if “Active”, this might indicate the power supply unit (PSU) on the workstation isn’t able to support the power draw from the GPU
- if “Active”, this might indicate the GPU temperature is too high
compute_mode: current compute mode (Default, Exclusive Process, etc.)
- the “default” compute mode allows users to launch multiple GPU jobs onto the same GPU via the Queue modal in the UI. See Queuing Directly To A GPU in the Guide.
- the “exclusive process” compute mode prevents a process from obtaining a context from a GPU while another process already has one, useful in anonymous multi-user scenarios
- set the compute mode of the GPU by running nvidia-smi -c compute_mode -i target_gpu_id where compute_mode is one of:
  - 0/Default, 1/Exclusive Thread, 2/Prohibited, 3/Exclusive Process
max_pcie_link_gen: maximum PCIe link generation (e.g., PCIe 3 or PCIe 4)
- information only
current_pcie_link_gen: current PCIe link generation
- information only
- this may be equal to or lower than the max_pcie_link_gen, as the GPU automatically switches to a higher link under load
temperature: current temperature
- information only
gpu_utilization: current utilization
- information only
memory_utilization: current memory utilization
- information only

Example data:

Obtaining GPU info via `nvidia-smi`...

NVIDIA GeForce RTX 3090 @ 00000000:01:00.0
    driver_version                :510.68.02
    persistence_mode              :Enabled
    power_limit                   :350.00
    sw_power_limit                :Not Active
    hw_power_limit                :Not Active
    compute_mode                  :Default
    max_pcie_link_gen             :4
    current_pcie_link_gen         :1
    temperature                   :25
    gpu_utilization               :0
    memory_utilization            :0

NVIDIA A100-PCIE-40GB @ 00000000:61:00.0
    driver_version                :510.68.02
    persistence_mode              :Enabled
    power_limit                   :250.00
    sw_power_limit                :Not Active
    hw_power_limit                :Not Active
    compute_mode                  :Default
    max_pcie_link_gen             :4
    current_pcie_link_gen         :4
    temperature                   :33
    gpu_utilization               :0
    memory_utilization            :0

Starting PyCuda GPU test on: NVIDIA A100-PCIE-40GB @ 0000:61:00.0
    PyCuda was compiled with CUDA: (11, 2, 0)
Finished PyCuda GPU test in 0.026s

Testing Tensorflow...
    Tensorflow found 4 GPUs.
Tensorflow test completed in 3.385s

Finally, PyCUDA (and optionally Tensorflow and PyTorch) will be tested to ensure they are working properly. If the either of these tests fail, the error will be summarized in the test results. For more information, check the failed job’s Event Log and stdout log (joblog).

cryoem9
  ✓ LAUNCH
  ✓ SSD
  ✕ GPU
    Error: Tensorflow detected 0 of 7 GPUs.
    See P1 J1222 for more information

Testing Tensorflow and PyTorch

By default, Tensorflow and PyTorch capabilities are not tested during the GPU test. To enable these tests, specify --test-tensorflow and/or --test-pytorch when starting the worker test. For example:

cryosparcm test workers P12 --test-tensorflow --target cryoem9.structura.dev

The PyTorch test will fail if the 3D Flex Refine dependencies were not installed using cryosparcw install-3dflex introduced in CryoSPARC v4.1.0. For more information, see <Link to 3D Flex Refine: Installing Dependencies>

If Tensorflow or PyTorch was not able to detect all GPUs on your system, the job will fail, and the error message will appear in the job's stdout log (found in the 'Metadata' tab of the Job Dialog).

PreviousGuide: Updating to CryoSPARC v4 NextGuide: Verify CryoSPARC Installation with the Extensive Validation Job (v4.3+)

Last updated 1 year ago