Job Manager for remote execution of QuantumATK scripts¶
Version: 2017.0
In this tutorial you will learn how to use the Job Manager for execution of QuantumATK jobs on remote computing clusters. In particular, you will learn how to:
- add a remote machine to the Machine Manager;
- use custom Machine Settings for individual jobs;
- add several different machines and import/export machine settings.
Important
You will set up a remote machine for running jobs in parallel using MPI, as well as with threading. We strongly recommend you go through the tutorial Job Manager for local execution of QuantumATK scripts before continuing with this one.
Since ATK 2017 Intel’s mpiexec.hydra
is provided on both Windows and Linux versions - this is the recommended way to run QuantumATK in parallel.
The mpiexec.hydra
binary shipped with QuantumATK is located in the folder libexec/mpiexec.hydra
present in your installation folder.
Note
There are two essential requirements when using the Job Manager for executing and managing QuantumATK jobs on a remote cluster. You need:
- QuantumATK installed on the local machine and on the cluster;
- an SSH connection from your local machine to the cluster.
Please refer to the tutorial SSH keys if you need help setting up the SSH connection.
A single remote machine¶
Open the Machine Manager, and click New in order to add a new machine.
Note
Please note that the Machine Manager options may differ between versions of QuantumATK.
The menu that appears has six options, Local, Remote PBS, Remote LSF, Remote SLURM, Remote Direct, and ATK On-Demand.
Choose the type of remote machine that matches the job scheduling on your remote Linux cluster, and start setting up the connection. For more information about the QuantumATK On-Demand option click here. Here, you will choose the Remote PBS machine to set up a machine with a PBS job scheduling system. The option Remote Direct is appropiate for clusters that have no queue system.
The Machine Settings widget pops up. It has five main tabs:
- Settings (remote connection),
- Environment (software on the remote cluster),
- Resources (allocated computing resources and time),
- Notifications (job progress updates),
- Diagnostics (check the current setting).
Hint
Most options have default values (which must be checked). Red fields are mandatory, but have no default.
Settings¶
Connection to the remote cluster.
You need to specify the following fields:
Machine name Hostname Username Use ssh key
The following options vary from cluster to cluster.
- Private key path
- The directory containing your private SSH key.
- Node type names
- Optional, but in this example you have access to node type “orange”.
- Queue names
- Optional, could for example be “long”.
- Path to PBS binaries
The directory containing the qsub and other PBS executables must be specified. Log on to the cluster and use the
which
command to locate it (e.g./usr/local/torque-4.2.8/bin
):$ which qsub /usr/local/torque-4.2.8/bin/qsub
Once all settings are added, you can chek if they are correct by navigating to the Diagnostics tag and clicking the “Run Diagnostics” button.
Tip
The Diagnostics tab checks if the options in the Settings and Environment tabs allow your local computer to connect to the remote cluster and execute the commands needed for job submission and management.
If some field is not specified, the diagnostics will check the default setting. If the connection to the cluster works well with that default, or is at least not disrupted, it will be marked as OK.
You therefore need to run an actual test job to make absolutely sure that all settings are indeed OK.
Environment¶
Computing environment on the cluster.
This tab concerns the environment (directories, executables, modules, etc.)
on the remote cluster, not on your local computer. $HOME
is therefore your
home directory on the cluster, and QuantumATK must of course be installed on
the cluster.
Since ATK 2017 Intel’s mpiexec.hydra
is provided on both Windows and Linux versions - this is the recommended way to run QuantumATK in parallel.
The mpiexec.hydra
binary shipped with QuantumATK is located in the folder libexec/mpiexec.hydra
present in your installation folder.
Any scripts that should be sourced in order to get the environment
working must be listed. The same goes for required export statements (may be
needed to correctly set the SNPSLMD_LICENSE_FILE
) and cluster modules that
should be loaded.
Resources¶
Computing resources requested at job submission.
This tab specifies the default computing resources (nodes, cores, queue,
time, etc.) requested at job submission. Note that MKL_DYNAMIC
is
disabled by default, which means that MKL is not allowed to dynamically
decrease the number of threads at runtime.
Tip
For a job with only MPI parallelization (no threading), the number of nodes times the number of cores per node should equal the (total) number of MPI processes. In the example above, you have 2 x 8 = 16 cores, so you ask for 16 MPI processes and specify that each node should have 8 of those processes (only one MPI process per core).
Notifications¶
Job progress reports.
The Job Manager will regularly check the job progress on the remote cluter and report it in the log file. You can also recieve e-mail notifications from the PBS scheduler when the job starts, finishes, or terminates.
Diagnostics¶
Use this tab to test the machine settings. Green check-mark indicates that all settings appear to be OK. Red circle indicates some problem that should be fixed in one of the tabs.
Save and test the new machine¶
Click OK to add the machine to the Machine Manager.
Next, you should run the test scripts mpi_check.py
and test_mpi.py
to test the
machine settings. In the QuantumATK main window, drag and drop a script onto the
Job Manager. Choose the newly added machine (in this
example “Salbacore”), and click OK.
Click to submit the job and watch how the Task State changes from Pending to Finished. For a longer job, you can click Download log to regularly retrieve the log file during the remote job execution.
Once a job has finised you should check the log file to inspect the job output to see if the expected number of processes were used. Click the icon to open the log. In the two examples below, 16 cores were used in total on two different nodes.
The job outputs have of course also appeared on the LabFloor:
Custom job settings¶
As explained in the tutorial Job Manager for local execution of QuantumATK scripts, you can customize
many of the job settings before submitting a job. Use the script silicon.py
as an example QuantumATK script. Download the script, send it to the Job Manager,
and choose the newly created remote machine. Then click the
Job Settings plug-in.
You can now customize the Resources and Notifications tabs, and thereby submit the job with settings different from the default ones you specified above. For example, you can change the number of requested cores and cluster queue and/or maximum wall-clock time. You can also change the notifications settings.
Set up the job settings as you like, e.g. 4 cores on a single node and 4 MPI processes. Then click OK and submit the calculation to the remote cluster.
The job log is automatically retrieved when the job finishes. For long jobs, however, you need to click “Download log” if you want to see the log while the job is running.
Debugging¶
If an error occurs during the execution of the job, this will be indicated by a red square in the queue, as shown below. You can then click the Debug logs icon to open the job debugs information window, which will show you details about the error.
Adding several remote machines¶
Several remote machines can be added to the Machine Manager. You can of course add new machines from scratch, but you can also export/import the settings of an existing machine and use those as a template for new machines.
Tip
The import/export functionality is very convenient for sharing machine settings within a group of researchers.
In the following, you will rename and export the settings of the newly created machine, and then add one more remote machine suitable for threaded calculations.
Rename and export¶
In the Job Manager, remove the jobs that are in the queue of the newly added machine (named “Sabalcore” in our example).
Click Machine Settings and choose to Edit the machine.
Note
You always need to empty the queue of a machine before you can edit its settings.
This machine is already set up for MPI parallelization. Therefore, append “: MPI” to the machine name, click Export, and save the settings in a file.
Click OK to accept the renaming and return to the Machine Manager window.
A machine for threaded jobs¶
Next, add a new remote machine with default settings for “Remote PBS”. Then click Import and import the settings file you just saved.
You can now modify the settings to create a machine with default settings that are suitable for a threaded calculation:
- In the Environment tab, remove the
OMP_NUM_THREADS=1
export statement. - In the Resources tab, use only a single MPI process but enable dynamic scheduling of MKL threads.
Give the new machine a reasonable name, e.g. “Salbacore: Threading”, and click OK to add the machine to the Machine Manager. This machine should be convenient for submitting ATK-ForceField calculations.