Guide: Cluster Integration Validation

How to validate the configuration of a cluster lane and monitor job status.

The information in this section applies to CryoSPARC v4.0+.

CryoSPARC v4.0 includes two improvements to cluster integration:

  • Cluster integrations can be tested and validated after being connected to CryoSPARC by running cryosparcm cluster validate

  • For jobs submitted to cluster lanes, the cluster scheduler is polled in an interval until the job enters running status, to prevent jobs that failed to submit or launch in the cluster scheduler from indefinitely appearing in CryoSPARC in launched status.

For more information on how to connect a cluster scheduler to CryoSPARC, see Connecting a Cluster to CryoSPARC.

Cluster Setup Testing

The cryosparcm cluster validate command validates the configuration of a cluster lane by running the configured submit, status, and delete (qsub_cmd_tpl, qstat_cmd_tpl, qdel_cmd_tpl) commands to submit test jobs to the cluster lane.

Usage: cryosparcm cluster validate <cluster_lane_name> --projects_dir <abs_path_to_projects_dir>

where

  • <cluster_lane_name> is the name of the cluster lane configured in CryoSPARC that will be tested

  • <abs_path_to_projects_dir> is the absolute path to the directory where the validation tests will be run. It is useful to provide the path to a directory where CryoSPARC projects will be stored on disk, as this directory is tested for readability/writability from cluster nodes and will be the output location for files created by the validation tests. The directory does not need to be empty.

The output of the jobs submitted during the test can be found in the projects_dir directory in test_cluster_qsub.txt and test_cluster_qdel.txt, and should be manually checked for correctness output once cryosparcm cluster validate finishes.

Example test_cluster_qsub.txt test output:

Process started 2022-08-25 21:08:58.999810
NODE : slurmnode1
PID  : 536
Executing for 5 secs
Process finished 2022-08-25 21:09:04.005305

Validation success!

Example test_cluster_qdel.txt test output for a Slurm cluster:

slurmstepd-slurmnode1: error: *** JOB 4 ON slurmnode1 CANCELLED AT 2022-08-25T21:09:03 ***

Cluster Job Status Monitoring

The status of jobs submitted to a cluster lane is checked by CryoSPARC on an interval using the configured qstat_cmd_tpl in the cluster configuration until the CryoSPARC job status is updated to running or the job fails to launch with an error. The current status can be found inside a CryoSPARC job’s event log as the job is launched.

Example job event log entry for a Slurm cluster job status update:

The length of interval between status updates and maximum retries can be configured using the CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL and CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES variables in config.sh. By default, cluster jobs will be checked every 10 seconds with 1000000 retries.

Last updated