Job Manager for remote execution of QuantumATK scripts

In this tutorial you will learn how to use the Job Manager for execution of QuantumATK jobs on remote computing clusters. In particular, you will learn how to:

  • add a remote machine to the Computers/Jobs;

  • use custom Machine Settings for individual jobs;

  • add several different machines and import/export machine settings.

Important

You will set up a remote machine for running jobs in parallel using MPI, as well as with threading. We strongly recommend you go through the tutorial Job Manager for local execution of QuantumATK scripts before continuing with this one.

Note

There are two essential requirements when using the Job Manager for executing and managing QuantumATK jobs on a remote cluster. You need:

  1. QuantumATK installed on the local machine and on the cluster;

  2. an SSH connection from your local machine to the cluster.

Please refer to the tutorial SSH keys if you need help setting up the SSH connection.

introbar

A single remote machine

Click on the qatkicon-jobmanager-pc Create computer to add a new machine.

snap25

The menu that appears has seven options, New DirectRemote, New Local, New LSF, New PBS, New PBSPro, New SGE and New SLURM,.

Choose the type of remote machine that matches the job scheduling on your remote Linux cluster, and start setting up the connection. Here, you will choose the New LSF machine to set up a machine with a LSF job scheduling system. The option New DirectRemote is appropriate for clusters that have no queue system.

The Computer Settings widget pops up. It has six main tabs:

Computer

Connection to the remote cluster.

snap27

This tab is used to provide the credentials for connecting to the remote cluster of choice. It supports password-based and key-based authentication.

Private key file

The path of your private SSH key, you might already have one or you can generate one following SSH keys.

Once all settings are added, you can check if they are correct by clicking the Check connection button.

Environment

Computing environment on the cluster.

snap28b

This tab concerns the environment (directories, executables, modules, etc.) on the remote cluster, not on your local computer. $HOME is therefore the home directory of your user on the cluster, and QuantumATK must of course be installed on the cluster.

Any scripts that should be sourced in order to get the environment working must be listed. The same goes for required export statements (may be needed to correctly set the SNPSLMD_LICENSE_FILE) and cluster modules that should be loaded using the module load command (https://modules.readthedocs.io).

You can also provide custom submission commands. For example, you can choose a particular queue to submit the job:

# Submit job to "myqueue" queue
BSUB -app myqueue

Queuing system

Queuing system settings and resources available for job submission.

../../../_images/07_queuing_system.png

The nodes available and cores per node values are used to determine the default job settings.

LSF path

This is the directory containing bsub and other LSF executables. It is not needed if Settings File is provided. Log on to the cluster and use the which` command to locate it:

$ which bsub

Job Settings

Computing resources requested at job submission.

../../../_images/08_job_settings.png

This tab specifies the default computing resources (nodes, cores, queue, time, etc.) requested at job submission. Once the number of cores to use on each node is specified, you can decide how to split them between MPI and threading parallelization. While changing the number of MPI processes, the number of threads per process is automatically adjusted.

The Job Manager will regularly check the job progress on the remote cluster and report it in the log file. You can also receive e-mail notifications from the job scheduler when the job starts, finishes, or terminates. Refer to your cluster administrator to know if email notifications are supported in your cluster.

Diagnostics

Use this tab to test the computer settings. Green check-mark indicates that all settings appear to be OK. Red cross indicates some problem that should be fixed in one of the tabs.

snap28

Warning

The Diagnostics tab checks if the options in the Computer and Environment tabs allow your local computer to connect to the remote cluster and execute the commands needed for job submission and management.

If some field is not specified, the diagnostics will check the default setting. If the connection to the cluster works well with that default, or is at least not disrupted, it will be marked as OK.

You therefore need to run an actual test job to make absolutely sure that all settings are indeed OK.

Example script

This shows an example of a job submission script. You can for example check that the options passed to the job scheduler are correct.

../../../_images/09_example_script.png

Save and test the new machine

Click OK to add the machine to the Computers/Jobs.

../../../_images/10_sidebar.png

Next, you should run the test scripts mpi_check.py and test_mpi.py to test the machine settings. In the QuantumATK main window, click on qatkicon-file1-add Add new job, then choose the script from the file dialog box, and click Open.

../../../_images/14_new_job.png

This will open the Computer Settings dialog box. Choose the remote machine, in this example MyLSFMachine, and click Submit.

../../../_images/15_job_submission.png

Once a job has finished you should check the log file to inspect the job output to see if the expected number of processes were used. Click the Log tab to open the log. In the two examples below, 48 cores were used in total on a single node.

../../../_images/17_log_1.png

snap29

Custom job settings

As explained in the tutorial Job Manager for local execution of QuantumATK scripts, you can customize many of the job settings before submitting a job. Use the script silicon.py as an example QuantumATK script. Download the script, and then Open the script from file dialog box by clicking Add new job. This will open a Submission Settings dialog box.

You can now customize the Job Defaults, Job Array, Notifications and File I/O tabs, and thereby submit the job with settings different from the default ones you specified above. For example, you can change the number of requested cores and cluster queue and/or maximum wall-clock time.

Set up the job settings as you like, e.g. 4 cores on a single node and 4 MPI processes. Then click OK and submit the calculation to the remote cluster.

Debugging

If an error occurs during the execution of the job, this will be indicated by a red icon in the queue, as shown below. You can then click the qatkicon-jobmanager-bug Debug logs icon to open the job debugs information window, which will show you details about the error.

../../../_images/22_job_error.png ../../../_images/23_error_bug.png

Adding several remote machines

Several remote machines can be added to the Computers/Jobs. You can of course add new machines from scratch, but you can also export/import the settings of an existing machine and use those as a template for new machines.

Tip

The import/export functionality is very convenient for sharing machine settings within a group of users.

In the following, you will export the settings of the newly created machine, and then add one more remote machine with different queue and job settings.

Export

Select the machine and click on updates_available_icon Import or export computers , then click updates_available_icon Export, and save the settings in a file.

../../../_images/19_export_computers.png

Import

Next, add a new remote machine with default settings for “New LSF”. click on updates_available_icon Import or export computers , then click qatkicon-download1 Import and import the settings file you just saved.

../../../_images/20_import_computers.png

You can now modify the settings to create a machine with default settings that are suitable for :

Select the new machine and click Edit. This will open the Computer Settings dialog box. Give the new machine a reasonable name, e.g. “MyNewLSFMachine” and set the password, if Password is selected for authentication. Then, change the job submission queue to “mynewqueue” by adding BSUB -app myqueue in Custom submission commands of the Environment tab. Additionally , you can change the Nodes available, e.g. to 2, and Cores per node, e.g. to 8, in the Queuing system tab. Finally, Run diagnostics in the Diagnostics tab to check if the settings are correct and click OK, if it passes all the checks.