Slurm

Slurm Overview

Slurm is a powerful open-source workload manager designed for managing and scheduling jobs on Linux clusters. It efficiently allocates computing resources and coordinates job execution in high-performance computing (HPC) environments, enabling users to run batch and parallel tasks at scale.

The Discovery Cluster utilizes Slurm to manage and schedule computing jobs. Slurm allocates resources such as CPUs and memory to ensure tasks execute efficiently and fairly across the cluster’s nodes. You can view the different partitions available on the Discovery Cluster in the Slurm Partition Overview.

Common user commands in Slurm include:

Command	Syntax	Description
sbatch	`sbatch <job script>`	Submit a batch job to the queue
squeue	`squeue`	Show status of Slurm batch jobs
srun	`srun <job script>`	Run interactive job
sinfo	`sinfo`	Show information about partitions
scontrol	`scontrol show job <JOBID>`	Used to check the status of a running or idle job
scancel	`scancel <JOBID>`	Cancel job

Batch jobs

To run a job in batch mode, first prepare a job script that specifies the application you want to launch and the resources required to run it. Then, use the sbatch command to submit your job script to Slurm.

For complete documentation about the sbatch command and its options, see the sbatch manual page via: man sbatch

Example submit script:

Slurm job scripts most commonly have at least one executable line preceded by a list of options that specify the resources and attributes needed to run your job (for example, wall-clock time, the number of nodes and processors, and filenames for job output and errors).

A job script for running a batch job on Discovery may look similar to the following:

#!/bin/bash
# Name of the job
#SBATCH --job-name=my_first_slurm_job

# Number of compute nodes
#SBATCH --nodes=1

# Number of tasks per node
#SBATCH --ntasks-per-node=1

# Number of CPUs per task
#SBATCH --cpus-per-task=1

# Request memory
#SBATCH --mem=8G

# Walltime (job duration)
#SBATCH --time=00:15:00

# Email notifications (comma-separated options: BEGIN,END,FAIL)
#SBATCH --mail-type=FAIL

module load module_name
./my_program arg1 arg2

In the above example:

The first line indicates that the script should be read using the Bash command interpreter.
The next lines are #SBATCH directives used to pass options to the sbatch command:
- --job_name specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the system.
- -o filename_%j.txt and -e filename_%j.err instruct Slurm to connect the job’s standard output and standard error, respectively, to the file names specified, where %j is automatically replaced by the job ID.
- --mail-type=<type> directs Slurm to send job-related email when an event of the specified type(s) occurs; valid type values include all, begin, end, and fail.
- --nodes=1 requests one node be allocated to this job.
- --ntasks-per-node=1 specifies that one task should be launched per node.
- --cpus-per-task=1 specifies that one CPU should be allocated per task.
- --mem=8G

A job script for running a batch job on the GPU nodes should contain --partition=gpuq and the --gres flag to indicate the type of GPU (k80 or v100) and the number of GPUs (1 or 4) to be allocated for the job. For example:

#!/bin/bash

#SBATCH -J job_name
#SBATCH --partition gpuq
#SBATCH --gres=gpu:k80:2
#SBATCH -o filename_%j.txt
#SBATCH -e filename_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=02:00:00

module load module_name
./my_program my_program_arguments

In your script, replace my_program and my_program_arguments with your program’s name and any needed arguments.

Depending on the resources needed to run your executable lines, you may need to include other sbatch options in your job script. Here a few other useful ones:

Depending on the resources needed to run your executable lines, you may need to include other sbatch options in your job script. Here a few other useful ones:

Option	Action
`--begin=YYYY-MM-DDTHH:MM:SS`	Defer allocation of your job until the specified date and time, after which the job is eligible to execute. For example, to defer allocation of your job until 10:30pm June 14, 2021, use:`bash--begin=2021-06-14T22:30:00`
`--no-requeue`	Specify that the job is not rerunnable. Setting this option prevents the job from being requeued after it has been interrupted, for example, by a scheduled downtime or preemption by a higher priority job.
`--export=ALL`	Export all environment variables in the `sbatch` command’s environment to the batch job.

Submit your job script

To submit your job script (for example,my_job.script), use the sbatch command. If the command runs successfully, it will return a job ID to standard output; for example, Discovery:

$ sbatch my_job.script
Submitted batch job 4311

MPI jobs

To run an MPI job, add #SBATCH directives to your script for requesting the required resources and add the srun command as an executable line for launching your application. For example, a job script for running an MPI job that launches 96 tasks across two nodes in the general partition on discovery could look similar to the following:

#!/bin/bash
  
#SBATCH -J mpi_job
#SBATCH -o mpi_%j.txt
#SBATCH -e mpi_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:30:00

cd /directory/with/stuff
srun my_program my_program_arguments

In your script, replace my_program and my_program_arguments with your program’s name and any needed arguments.

OpenMP and hybrid OpenMP-MPI jobs

To run an OpenMP or hybrid OpenMP-MPI job, use the srun command and add the necessary #SBATCH directives as in the previous example, but also add an executable line that sets the OMP_NUM_THREADS environment variable to indicate the number of threads that should be used for parallel regions. For example, a job script for running a hybrid OpenMP-MPI job that launches 16 tasks across two nodes in the standard partition on discovery could look similar to the following:

#!/bin/bash

#SBATCH -J hybrid_job
#SBATCH -o hybrid_%j.txt
#SBATCH -e hybrid_%j.err
#SBATCH --mail-type=ALL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --time=00:05:00
  
export OMP_NUM_THREADS=2
cd /directory/with/stuff
srun my_program my_program_arguments

In your script, replace my_program and my_program_arguments with your program’s name and any needed arguments.

You also can bind tasks to CPUs with the srun command’s --cpu-bind option. For example, to modify the previous example so that it binds tasks to sockets, add the --cpu-bind=sockets option to the srun command:

#!/bin/bash
  
#SBATCH -J hybrid_job
#SBATCH -o hybrid_%j.txt
#SBATCH -e hybrid_%j.err
#SBATCH --mail-type=ALL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=8
#SBATCH --time=00:05:00

export OMP_NUM_THREADS=2
cd /directory/with/stuff
srun --cpu-bind=sockets my_program my_program_arguments

In your script, replace my_program and my_program_arguments with your program’s name and any needed arguments.

Supported binding options include --cpu-bind=mask_cpu:<list>, which binds by setting CPU masks on tasks as indicated in the specified list. To view all available CPU bind options, on the discovery command line, enter:

$ srun --cpu-bind=help

Interactive jobs

To request resources for an interactive job, use the srun command with the –pty option.

For example:

$ srun --pty /bin/bash
$ hostname
p04.hpcc.dartmouth.edu

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. The account must also be specified else the job will not run otherwise. If additional resources are required, they can be requested as options to the srun command.

The following example job is assigned 2 nodes with 2 CPUS and 4GB of memory each:

$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --pty /bin/bash
[q06 ~]$ 

When the requested resources are allocated to your job, you will be placed at the command prompt within a cluster compute node. Once you are placed on a compute node, you can begin execute your code interactively

Note: When you are finished with your interactive session, on the command line, enter exit to free the allocated resources.

For complete documentation about the srun command, see the srun manual page via: man srun

Monitor or delete your job

To monitor the status of jobs in a Slurm partition, use the squeue command.

Some useful squeue options include:

Option	Description
`-a`	Display information for all jobs.
`-j <jobid>`	Display information for the specified job ID.
`-j <jobid> -o %all`	Display all information fields (with a vertical bar separating each field) for the specified job ID.
`-l`	Display information in long format.
`-n <job_name>`	Display information for the specified job name.
`-p <partition_name>`	Display jobs in the specified partition.
`-t <state_list>`	Display jobs that have the specified state(s). Valid job states include: PENDING, RUNNING, SUSPENDED, COMPLETED, CANCELLED, FAILED, TIMEOUT, NODE_FAIL, PREEMPTED, BOOT_FAIL, DEADLINE, OUT_OF_MEMORY, COMPLETING, CONFIGURING, RESIZING, REVOKED, and SPECIAL_EXIT.
`-u <username>`	Display jobs owned by the specified user.
For complete documentation about the `squeue` command, see the `squeue` manual page.

To delete your pending or running job, use the scancel command with your job’s job ID; for example, to delete your job that has a job ID of 4632, on the command line, enter:

$ scancel 4632

Alternatively:

To cancel a job named my_fist_job, enter:

$ scancel -n my_first_job

To cancel a job owned by username, enter:

$ scancel -u username

For complete documentation about the scancel command, see the scancel manual page via: man scancel

View partition and compute node information

To view information about the nodes and partitions that Slurm manages, use the sinfo command.

By default, sinfo (without any options) displays:

All partition names
Availability of each partition
Maximum wall time allowed for jobs in each partition
Number of compute nodes in each partition
State of the compute nodes in each partition
Names of the compute nodes in each partition

To display node-specific information, use sinfo -N, which will list:

All node names
Partition to which each node belongs
State of each node

To display additional node-specific information, use sinfo -lN, which adds the following fields to the previous output:

Number of cores per node
Number of sockets per node, cores per socket, and threads per core
Size of memory per node in megabytes

Specification	Field displayed
`%<#>P`	Partition name (set field width to # characters)
`%<#>N`	List of node names (set field width to # characters)
`%<#>c`	Number of cores per node (set field width to # characters)
`%<#>m`	Size of memory per node in megabytes (set field width to # characters)
`%<#>l`	Maximum wall time allowed (set field width to # characters)
`%<#>s`	Maximum number of nodes allowed per job (set field width to # characters)
`%<#>G`	Generic resource associated with a node (set field width to # characters)

$ sinfo -No "%10P %8N  %4c  %7m  %12l %10G"

The resulting output looks similar to this:

PARTITION  NODELIST  CPUS  MEMORY   TIMELIMIT    GRES
gpuq       g08       16    128640   infinite     gpu:k80:4(
gpuq       g10       16    128640   infinite     gpu:k80:4(
gpuq       g11       16    128640   infinite     gpu:k80:4(
bigmem     k25       16    64132    infinite     (null)
bigmem     k26       16    64132    infinite     (null)
bigmem     k27       16    64132    infinite     (null)
bigmem     k28       16    64132    infinite     (null)
bigmem     k29       16    64132    infinite     (null)
bigmem     k30       16    64132    infinite     (null)
bigmem     k31       16    64132    infinite     (null)
bigmem     k32       16    64132    infinite     (null)
bigmem     k33       16    64132    infinite     (null)
bigmem     k34       16    64132    infinite     (null)
bigmem     k35       16    64132    infinite     (null)
bigmem     k36       16    64132    infinite     (null)
bigmem     k37       16    64132    infinite     (null)
bigmem     k38       16    64132    infinite     (null)
bigmem     k39       16    64132    infinite     (null)
bigmem     k40       16    64132    infinite     (null)
bigmem     k41       16    64132    infinite     (null)

For complete documentation about the sinfo command, see the sinfo manual page via: man sinfo

Credit https://kb.iu.edu/d/awrz

Investing in Discovery

Slurm Coordinator Role

On this page: