Slurm is the job scheduler we use in our cluster. More info about what a job scheduler is can be found in the Introduction. Here we will go more into depth about some elements of the scheduler. There are many more features of Slurm that go beyond the scope of this guide, but all that you as a user needs to know should be available.

Partitions / Queues

Our cluster has a number of slurm partitions defined, also known as a queue: * cpu * gpu

As you may have guessed, you as the user request to use a specific partition based on what resources your job needs. All above partitions have a default 4 hour runtime limit. The maximum time a job can run is infinite, but it is the responsibility of the user to set the job’s max time to override the default value.

To view all available partitions in Slurm, you can run the command sinfo.

For those that purchase hardware for Unity and choose to have priority/sole access to a partition, that parition will be listed but if you are not a member of that group you will not be able to use it.

Jobs

A job is an operation which the user submits to the cluster to run under allocated resources.

When submitting a job to the cluster, you will have a few options. The first is to use srun, which will simply run the command you specify in the srun command with any parameters specified.

The other option is an interactive job. This job will allocate resources and start the job interactively, so that you can view what is happening live and interact with the prompt. If you leave this prompt the job will be cancelled. For example, you can use this feature together with the bash command to open an interactive bash shell with allocated resources.

Using sbatch to submit jobs

SBATCH is a non-blocking command, meaning there is not a circumstance where running the command will cause it to hold. Even if the resources requested are not available, the job will be thrown into the queue and will start to run once resources become available. The status of a job can be seen using squeue.

SBATCH is based around running a single file. That being said, you shouldn’t need to specify any parameters in the command other than sbatch <batch file>, because you can specify all parameters in the command inside the file itself.

The following is an example of a batch script. Please note that the top of the script must start with #!/bin/bash (or whatever interpreter you need, if you don’t know, use bash), and then immediately follow with #SBATCH <param> parameters. An example of common SBATCH parameters and a simple script is below, this script will allocate 4 CPUs and one GPU in the GPU partition.

#!/bin/bash
#SBATCH -c 4  # Number of Cores per Task
#SBATCH --mem=8192  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 01:00:00  # Job time limit
#SBATCH -o slurm-%j.out  # %j = job ID

module load cuda/10
/modules/apps/cuda/10.1.243/samples/bin/x86_64/linux/release/deviceQuery

This script should query the available GPUs, and print only one device to the specified file. Feel free to remove/modify any of the parameters in the script to suit your needs.

Usually, if you have to run a single application multiple times, or if you are trying to run a non-interactive application, you should use sbatch instead of srun, since sbatch allows you to specify parameters in the file, and is non-blocking (see below).

Using srun to submit jobs

SRUN is a so-called blocking command, as in it will not let you execute other commands until this command is finished (not necessarily the job, just the allocation). For example, if you run srun /bin/hostname and resources are available right away, the job will be sent out and the result saved into a file. If resources are not available, you will be stuck in the command while you are pending in the queue.

Please note that like sbatch, you can run a batch file using srun.

The command syntax is srun <options> [executable] <args>

Options is where you can specify the resources you want for the executable, or define. The following are some of the options available; to see all available parameters run man srun.

  • -c <num> Number of CPUs (threads) to allocate to the job per task
  • -n <num> The number of tasks to allocate (for MPI)
  • -G <num> Number of GPUs to allocate to the job
  • --mem <num> Memory to allocate to the job (in MB by default)
  • -p <partition> Partition to submit the job to

To run an interacitve job (in this case a bash prompt), the command might look like this (--pty is the important option):

srun -c -p cpu --pty bash

To run an application on the cluster that uses a GUI, you must use an interactive job, in addition to the --x11 argument:

srun -c 6 -p cpu --pty --x11 xclock

You cannot run an interactive/gui job using the sbatch command, you must use srun.