Slurm
Slurm is a powerful open-source workload manager designed for managing and scheduling jobs on Linux clusters. It efficiently allocates computing resources and coordinates job execution in high-performance computing (HPC) environments, enabling users to run batch and parallel tasks at scale.
The Discovery Cluster utilizes Slurm to manage and schedule computing jobs. Slurm allocates resources such as CPUs and memory to ensure tasks execute efficiently and fairly across the cluster’s nodes. You can view the different partitions available on the Discovery Cluster in the Slurm Partition Overview.
Common user commands in Slurm include:
Command | Syntax | Description |
---|---|---|
sbatch | sbatch | Submit a batch job to the queue |
squeue | squeue | Show status of Slurm batch jobs |
srun | srun | Run interactive job |
sinfo | sinfo | Show information about partitions |
scontrol | scontrol show job | Used to check the status of a running or idle job |
scancel | scancel | Cancel job |
Batch jobs
To run a job in batch mode, first prepare a job script that specifies the application you want to launch and the resources required to run it. Then, use the sbatch command to submit your job script to Slurm.
For complete documentation about the sbatch command and its options, see the sbatch manual page via: man sbatch
Example submit script: Slurm job scripts most commonly have at least one executable line preceded by a list of options that specify the resources and attributes needed to run your job (for example, wall-clock time, the number of nodes and processors, and filenames for job output and errors).
A job script for running a batch job on Discovery may look similar to the following:
#!/bin/bash
# Name of the job
#SBATCH --job-name=my_first_slurm_job
# Number of compute nodes
#SBATCH --nodes=1
# Number of tasks per node
#SBATCH --ntasks-per-node=1
# Number of CPUs per task
#SBATCH --cpus-per-task=1
# Request memory
#SBATCH --mem=8G
# Walltime (job duration)
#SBATCH --time=00:15:00
# Email notifications (comma-separated options: BEGIN,END,FAIL)
#SBATCH --mail-type=FAIL
module load module_name
./my_program arg1 arg2
In the above example:
The first line indicates that the script should be read using the Bash command interpreter.
The next lines are #SBATCH
directives used to pass options to the sbatch
command:
--job_name
specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the system.
-o filename_%j.txt
and -e filename_%j.err
instruct Slurm to connect the job’s standard output and standard error, respectively, to the file names specified, where %j
is automatically replaced by the job ID.
--mail-type=<type>
directs Slurm to send job-related email when an event of the specified type(s) occurs; valid type
values include all
, begin
, end
, and fail
.
--nodes=1
requests one node be allocated to this job.
--ntasks-per-node=1
specifies that one task should be launched per node.
--cpus-per-task=1
specifies that one CPU should be allocated per task.
--mem=8G
A job script for running a batch job on the GPU nodes should contain --partition=gpuq
and the --gres
flag to indicate the type of GPU (k80
or v100
) and the number of GPUs (1
or 4
) to be allocated for the job. For example:
#!/bin/bash
#SBATCH -J job_name
#SBATCH --partition gpuq
#SBATCH --gres=gpu:k80:2
#SBATCH -o filename_%j.txt
#SBATCH -e filename_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=02:00:00
module load module_name
./my_program my_program_arguments
In your script, replace my_program
and my_program_arguments
with your program’s name and any needed arguments.
Depending on the resources needed to run your executable lines, you may need to include other sbatch options in your job script. Here a few other useful ones:
Depending on the resources needed to run your executable lines, you may need to include other sbatch
options in your job script. Here a few other useful ones:
Option | Action |
---|---|
--begin=YYYY-MM-DDTHH:MM:SS |
Defer allocation of your job until the specified date and time, after which the job is eligible to execute. For example, to defer allocation of your job until 10:30pm June 14, 2021, use:bash--begin=2021-06-14T22:30:00 |
--no-requeue |
Specify that the job is not rerunnable. Setting this option prevents the job from being requeued after it has been interrupted, for example, by a scheduled downtime or preemption by a higher priority job. |
--export=ALL |
Export all environment variables in the sbatch command’s environment to the batch job. |
Submit your job script
To submit your job script (for example,my_job.script
), use the sbatch
command. If the command runs successfully, it will return a job ID to standard output; for example, Discovery:
$ sbatch my_job.script
Submitted batch job 4311
To run an MPI job, add #SBATCH
directives to your script for requesting the required resources and add the srun command as an executable line for launching your application. For example, a job script for running an MPI job that launches 96 tasks across two nodes in the general partition on discovery could look similar to the following:
#!/bin/bash
#SBATCH -J mpi_job
#SBATCH -o mpi_%j.txt
#SBATCH -e mpi_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:30:00
cd /directory/with/stuff
srun my_program my_program_arguments
In your script, replace my_program
and my_program_arguments
with your program’s name and any needed arguments.
OpenMP and hybrid OpenMP-MPI jobs
To run an OpenMP or hybrid OpenMP-MPI job, use the srun
command and add the necessary #SBATCH
directives as in the previous example, but also add an executable line that sets the OMP_NUM_THREADS
environment variable to indicate the number of threads that should be used for parallel regions. For example, a job script for running a hybrid OpenMP-MPI job that launches 16 tasks across two nodes in the standard partition on discovery could look similar to the following:
#!/bin/bash
#SBATCH -J hybrid_job
#SBATCH -o hybrid_%j.txt
#SBATCH -e hybrid_%j.err
#SBATCH --mail-type=ALL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --time=00:05:00
export OMP_NUM_THREADS=2
cd /directory/with/stuff
srun my_program my_program_arguments
In your script, replace my_program
and my_program_arguments
with your program’s name and any needed arguments.
You also can bind tasks to CPUs with the srun
command’s --cpu-bind
option. For example, to modify the previous example so that it binds tasks to sockets, add the --cpu-bind=sockets
option to the srun command:
#!/bin/bash
#SBATCH -J hybrid_job
#SBATCH -o hybrid_%j.txt
#SBATCH -e hybrid_%j.err
#SBATCH --mail-type=ALL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=8
#SBATCH --time=00:05:00
export OMP_NUM_THREADS=2
cd /directory/with/stuff
srun --cpu-bind=sockets my_program my_program_arguments
In your script, replace my_program
and my_program_arguments
with your program’s name and any needed arguments.
Supported binding options include --cpu-bind=mask_cpu:<list>
, which binds by setting CPU masks on tasks as indicated in the specified list. To view all available CPU bind options, on the discovery command line, enter:
$ srun --cpu-bind=help
To request resources for an interactive job, use the srun command with the –pty option.
For example:
$ srun --pty /bin/bash
$ hostname
p04.hpcc.dartmouth.edu
Jobs submitted with srun –pty /bin/bash
will be assigned the cluster default values of 1 CPU and 1024MB of memory. The account must also be specified else the job will not run otherwise. If additional resources are required, they can be requested as options to the srun command.
The following example job is assigned 2 nodes with 2 CPUS and 4GB of memory each:
$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --pty /bin/bash
[q06 ~]$
When the requested resources are allocated to your job, you will be placed at the command prompt within a cluster compute node. Once you are placed on a compute node, you can begin execute your code interactively
Note:
When you are finished with your interactive session, on the command line, enter exit
to free the allocated resources.
For complete documentation about the srun
command, see the srun manual page via: man srun
Monitor or delete your job
To monitor the status of jobs in a Slurm partition, use the squeue
command.
Some useful squeue
options include:
Option | Description |
---|---|
-a |
Display information for all jobs. |
-j <jobid> |
Display information for the specified job ID. |
-j <jobid> -o %all |
Display all information fields (with a vertical bar separating each field) for the specified job ID. |
-l |
Display information in long format. |
-n <job_name> |
Display information for the specified job name. |
-p <partition_name> |
Display jobs in the specified partition. |
-t <state_list> |
Display jobs that have the specified state(s). Valid job states include: PENDING, RUNNING, SUSPENDED, COMPLETED, CANCELLED, FAILED, TIMEOUT, NODE_FAIL, PREEMPTED, BOOT_FAIL, DEADLINE, OUT_OF_MEMORY, COMPLETING, CONFIGURING, RESIZING, REVOKED, and SPECIAL_EXIT. |
-u <username> |
Display jobs owned by the specified user. |
For complete documentation about the squeue command, see the squeue manual page. |
To delete your pending or running job, use the scancel
command with your job’s job ID; for example, to delete your job that has a job ID of 4632
, on the command line, enter:
$ scancel 4632
Alternatively:
To cancel a job named my_fist_job
, enter:
$ scancel -n my_first_job
To cancel a job owned by username
, enter:
$ scancel -u username
For complete documentation about the scancel
command, see the scancel
manual page via: man scancel
To view information about the nodes and partitions that Slurm manages, use the sinfo
command.
By default, sinfo
(without any options) displays:
To display node-specific information, use sinfo -N
, which will list:
To display additional node-specific information, use sinfo -lN
, which adds the following fields to the previous output:
Specification | Field displayed |
---|---|
%<#>P |
Partition name (set field width to # characters) |
%<#>N |
List of node names (set field width to # characters) |
%<#>c |
Number of cores per node (set field width to # characters) |
%<#>m |
Size of memory per node in megabytes (set field width to # characters) |
%<#>l |
Maximum wall time allowed (set field width to # characters) |
%<#>s |
Maximum number of nodes allowed per job (set field width to # characters) |
%<#>G |
Generic resource associated with a node (set field width to # characters) |
$ sinfo -No "%10P %8N %4c %7m %12l %10G"
The resulting output looks similar to this:
PARTITION NODELIST CPUS MEMORY TIMELIMIT GRES
gpuq g08 16 128640 infinite gpu:k80:4(
gpuq g10 16 128640 infinite gpu:k80:4(
gpuq g11 16 128640 infinite gpu:k80:4(
bigmem k25 16 64132 infinite (null)
bigmem k26 16 64132 infinite (null)
bigmem k27 16 64132 infinite (null)
bigmem k28 16 64132 infinite (null)
bigmem k29 16 64132 infinite (null)
bigmem k30 16 64132 infinite (null)
bigmem k31 16 64132 infinite (null)
bigmem k32 16 64132 infinite (null)
bigmem k33 16 64132 infinite (null)
bigmem k34 16 64132 infinite (null)
bigmem k35 16 64132 infinite (null)
bigmem k36 16 64132 infinite (null)
bigmem k37 16 64132 infinite (null)
bigmem k38 16 64132 infinite (null)
bigmem k39 16 64132 infinite (null)
bigmem k40 16 64132 infinite (null)
bigmem k41 16 64132 infinite (null)
For complete documentation about the sinfo
command, see the sinfo
manual page via: man sinfo
Credit https://kb.iu.edu/d/awrz
Related Articles: Slurm Coordinator Role for Managing Users in an Account.