SBATCH

Discovery Job Submission

Single Core Job Example  

Sample Slurm Script to Submit a Single Processor Job.

Create a script file that includes the details of the job that you want to run.

It can include the name of the program, the memory, wall time and processor requirements of the job, which queue it should run in and how to notify you of the results of the job.

Here is an example submit script.

#!/bin/bash

# Name of the job
#SBATCH --job-name=single-core-test-job

# Number of compute nodes
#SBATCH --nodes=1

# Number of cores, in this case one
#SBATCH --ntasks-per-node=1

# Walltime (job duration)
#SBATCH --time=00:15:00

# Email notifications
#SBATCH --mail-type=BEGIN,END,FAIL

hostname
date
sleep 60

All of the lines that begin with a #SBATCH are directives to Slurm. The meaning of the directives in the sample script are exampled in a comment line that precedes the directive.

The full list of available directives is explained in the man page for the sbatch command which is available on discovery.

sbatch will copy the current shell environment and the scheduler will recreate that environment on the allocated compute node when the job starts. The job script does NOT run .bashrc or .bash_profile, and so may not have the same environment as a fresh login shell. This is important if you use aliases, or the conda system to set up your own custom version of python and sets of python packages. Since conda defines shell functions, it must be configured before you can call, e.g. conda activate my-envThe simplest way to do this is for the first line of your script to be:

 #!/bin/bash -l

Which explicitly starts bash as a login shell

Now submit the job and check its status:

[user@discovery slurm]$ sbatch my_first_slurm.sh
Submitted batch job 4056

[user@discovery slurm]$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
4056  standard multicor     john  R       0:01      1 p04

[user@discovery slurm]$ scontrol show job 4056
JobId=4056 JobName=multicore_job
UserId=user(48374) GroupId=rc-users(480987) MCS_label=rc
Priority=4294901747 Nice=0 Account=rc QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:09 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2021-05-14T12:25:53 EligibleTime=2021-05-14T12:25:53
AccrueTime=2021-05-14T12:25:53
StartTime=2021-05-14T12:25:54 EndTime=2021-05-14T12:40:54 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-14T12:25:54
Partition=standard AllocNode:Sid=discovery7:21489
ReqNodeList=(null) ExcNodeList=(null)
NodeList=p04
BatchHost=p04
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/dartfs-hpc/rc/home/8/dz99918/xnode_tests/slurm/my_first_slurm.sh
WorkDir=/dartfs-hpc/rc/home/8/dz99918/xnode_tests/slurm
StdErr=/dartfs-hpc/rc/home/8/dz99918/xnode_tests/slurm/slurm-4056.out
StdIn=/dev/null
StdOut=/dartfs-hpc/rc/home/8/dz99918/xnode_tests/slurm/slurm-4056.out
Power=
MailUser=<email> MailType=BEGIN,END,FAIL
NtasksPerTRES:0

JOBID is the unique ID of the job – in this case it is 4056. In the above example I am issuing scontrol to view information related to my job

The output file, slurm-4056.out, consists of three sections:

A header section, Prologue, which gives information such as JOBID, user name and node list. A body section which include user output to STDOUT. A footer section, Epilogue, which is similar to the header. A useful difference is the report of wallclock time towards the end. Typically your job will create one file and join STDOUT & STDERR. To have your job create two files for STDOUT & STDERR be sure to pass –output and –error. Here is an example:

--output=My_first_job-%x.%j.out
--error=My_First_job-%x.%j.err

File Management In a Batch Queue System Sometimes you may be running the same program in multiple jobs and you will need to be sure to keep your input and output files separate for each job.

One way to manage your data files is to have a separate directory for each job.

Copy the required input files to the directory and then edit the batch script file to include a line where you change to the directory that contains the input files.

cd /path/to/where/your/input/files/are

Place this line before the line where you issue the command to be run. By default your job files will be created in the directory that you submit from.

Multi-Core Job Example  

Below is an example script which will submit for 4 cores on a single compute node. Feel free to copy and paste it as a job template.

#!/bin/bash
#SBATCH --job-name=multicore_job
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --time=00:15:00
#SBATCH --mail-type=BEGIN,END,FAIL
mpirun -n 4 ./program_name <optional args>

When you are ready to submit the job, you can do so by issuing the sbatch command:

sbatch <jobscript>

For more information about job parameters, please take a look at:

Slurm Workload Manager (External)

GPU Job Example  

#!/bin/bash

# Name of the job
#SBATCH --job-name=gpu_job

# Number of compute nodes
#SBATCH --nodes=1

# Number of cores, in this case one
#SBATCH --ntasks-per-node=1

# Request the GPU partition
#SBATCH --partition gpuq

# Request the GPU resources
#SBATCH --gres=gpu:2

# Walltime (job duration)
#SBATCH --time=00:15:00

# Email notifications
#SBATCH --mail-type=BEGIN,END,FAIL

nvidia-smi
echo $CUDA_VISIBLE_DEVICES
hostname

After submitting the job via sbatch, the output file contains the requested resources as shown by the nvidia-smi command and from the output of $CUDA_VISIBLE_DEVICES

!/bin/bash
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   33C    P0    39W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
0, 1
p04.hpcc.dartmouth.edu

Your program needs to know which GPU it has been assigned and the submission template above used $CUDA_VISIBLE_DEVICES to determine which GPU number your job has been assigned. You should pass the GPU number to your program as a command line argument and then set the default GPU in your code.

./program_name $CUDA_VISIBLE_DEVICES

Available GPU types can be found with the command sinfo -O gres -p <partition>. GPUs can be requested in both Batch and Interactive jobs.

$ sinfo -O gres -p gpuq   
GRES
gpu:nvidia_a100_80gb

sinfo -O gres -p a5500
GRES
gpu:nvidia_rtx_a5500

Job Array example  

Job Arrays Job arrays are multiple jobs to be executed with identical or related parameters. Job arrays are submitted with -a <indices> or –array=<indices>. The indices specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a “-” separator: –array=0-15 or –array=0,6,16-32.

A step function can also be specified with a suffix containing a colon and number. For example,–array=0-15:4 is equivalent to –array=0,4,8,12. A maximum number of simultaneously running tasks from the job array may be specified using a “%” separator. For example –array=0-15%4 will limit the number of simultaneously running tasks from this job array to 4. The minimum index value is 0. The maximum value is 499999.

To receive mail alerts for each individual array task, –mail-type=ARRAY_TASKS should be added to the Slurm job script. Unless this option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.

Below is an example submit script for submitting job arrays:

#!/bin/bash -l
# sbatch stops parsing at the first line which isn't a comment or whitespace
# SBATCH directives must be at the start of the line -- no indentation

# Name of the job
#SBATCH --job-name=sample_array_job

# Number of cores
#SBATCH --ntasks-per-node=1

# Array jobs.  This example will create 25 jobs, but only allow at most 4 to run concurrently
#SBATCH --array=1-25%4

# Walltime (job duration)
#SBATCH --time=00:15:00

# Email notifications
#SBATCH --mail-type=BEGIN,END,FAIL

# Your commands go here.  Each of the jobs is identical apart from environment variable
# $SLURM_ARRAY_TASK_ID, which will take values in the range 1-25
# They are all independent, and may run on different nodes at different times.
# The $SLURM_ARRAY_TASK_ID variable can be used to construct parameters to programs, select data files etc.
#
# The default output files will contain both the Job ID and the array task ID, and so will be distinct.  If setting
# custom output files, you must be sure that array tasks don't all overwrite the same files.

echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

sleep 300
hostname -s

Each job in the array will be allocated its own resources, possibly on different nodes.
The variable $SLURM_ARRAY_TASK_ID will be different for each task, with values (in this example) 1-25, and can be used to construct arguments to programs to be run as part of the job. One way to use such an array is to create a file with 25 sets of arguments in it, then use SLURMARRAYTASKIDasanindexintothefile.TheSLURM_ARRAY_TASK_ID as an index into the file. The (sed …) construct returns a single line from the file.

e.g.

arguments=/path/to/file/with/program/arguments  # 25-line file
myprogram $(sed -n -e "${SLURM_ARRAY_TASK_ID}p" $arguments)

Scheduling Jobs  

The Batch System

  • The batch system used on Discovery is Slurm.
  • Users login, via ssh, to one of the submit nodes and submit jobs to be run on the compute nodes by writing a script file that describes their job.
  • They submit the job to the cluster using the sbatch command.

There are four primary partitions on the cluster:

  • standard This is the main non-GPU queue for the cluster. It does not need to be specified as it is used by default.
  • gpuq This is a queue set up to run GPU related jobs on the two production GPU nodes. Open to all users.
  • preemptable A larger set of non-GPU nodes available to all users, but jobs from free account users have lower priority and may be requeued by a high priority account.

Users specify the amount of time and the number of processors required by their jobs. Several additional preemptable partitions exist for the newer GPU nodes.

Managing and Monitoring your jobs Some useful commands:

Command Usage Description
sbatch sbatch <job script> Submit a batch job to the queue
squeue squeue Show status of Slurm batch jobs
scancel scancel JOBID Cancel job
sinfo sinfo Show information about partitions
scontrol scontrol show job JOBID Check the status of a running or idle job

The default length of any job submitted to the queue is currently set at one hour and the default maximum number of processors per user is set to a value based on their user status.

Information on Submitting Jobs to the Queue

  • Jobs that run longer than thirty days will be terminated by the scheduler.
  • These parameters are subject to change as we become more familiar with users’ needs.
  • It is important for users to specify the resources required by their jobs. The default is 1 CPU, 8GB memory, on a single node.
  • In the current configuration, the walltime and the number of nodes are the two parameters that matter.
  • If you don’t specify the walltime, the system default of one hour will be assumed and your job may be terminated early.
  • See the Single Processor Job Example for further details.
  • Scripts initiated by sbatch will have the environment which exists when you run the sbatch command, unless #!/bin/bash -l is used, in which case the job has the environment of a fresh login to Discovery (recommended).

Information for Multiprocessor Jobs  

  • For multiprocessor jobs, it is important to specify the number of nodes and processors required and to select nodes that are of the same architecture.
  • The nodes are divided into partitions. The nodes in each partition are homogeneous with similar chip vendors and speed, as well as disk and memory size.
  • See the Sample parallel job scripts for examples of how to submit parallel jobs. Parallel programs that need to communicate between processes will run more efficiently if all of the processes are in the same group.
  • In general, only MPI jobs know how to utilize multiple nodes

For example:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
  • Before you submit your job, use the sinfo command to see which nodes are currently running jobs, so you can select a partition that has free nodes.
  • Programs using OpenMP or similar multithreading technology can use multiple CPUs for a single task.
  • For these jobs, leave the nodes at the default (1), and use, for example:
#SBATCH --cpus-per-task=4