Discovery Tutorials

Discovery tutorials for submitting jobs to the cluster.

Sample MPI lab (Hello World)
Sample Python lab walltime
Sample R lab (Hello World)

Sample MPI lab (Hello World)

In this lab we will use the MPICH version of MPI to compile a very basic “Hello World!” script, which we will then submit to run across multiple compute nodes.

Once you have logged into the discovery cluster. The first thing you will want to do, is load the mpi libraries into your path with the mpi module. You can do this by:

$ module load mpi/mpich-x86_64

Next we need to create the MPI script. With your favorite editor, create a new file called “sample_mpi_hello_world.c”

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
  MPI_Init(NULL, NULL);

  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  // Print a hello world message
  printf("Hello world from processor %s, rank %d out of %d processors\n",
         processor_name, world_rank, world_size);

  MPI_Finalize();
}

Next we compile the script via:

$ mpicc -o sample_mpi_hello_world sample_mpi_hello_world.c

Once complete, the program has been compiled. You can test the program by trying to run it across 4 CPUs like this:

$ mpirun -n 4 ./sample_mpi_hello_world
Hello world from processor discovery7.hpcc.dartmouth.edu, rank 0 out of 4 processors
Hello world from processor discovery7.hpcc.dartmouth.edu, rank 1 out of 4 processors
Hello world from processor discovery7.hpcc.dartmouth.edu, rank 2 out of 4 processors
Hello world from processor discovery7.hpcc.dartmouth.edu, rank 3 out of 4 processors

Now that we have a working mpi program, the next step is to launch it as a batch job. Here is an example batch script you can use which requests multiple CPUs on multiple nodes. We named it “sample_mpi.sh” for this example.

#!/bin/bash -l
# Name of the partition / queue
#SBATCH --partition=standard
# Name of the cluster account
# How long should I job run for
#SBATCH --time=01:00:00
# Number of CPU cores, in this case 4 cores
#SBATCH --ntasks=4
# Number of compute nodes to use, in this case 2
#SBATCH --nodes=2
# Name of the output files to be created. If not specified the outputs will be joined
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err

The code you want to run in your job

module load mpi/mpich-x86_64 ​​​​​​​
mpirun -n 4 ./sample_mpi_hello_world

Once we have our submit script created, we can submit it to the cluster via sbatch:

$ sbatch sample_mpi.sh
Submitted batch job 3977

To see the job running, type the squeue command and look for the above job ID.

When the job runs it will create two output files:

sample_mpi_hello_world.sh.3977.out

sample_mpi_hello_world.sh.3977.err

We can see from the output file, “sample_mpi_hello_world.sh.3977.out” that the job ran on 4 cores across 2 compute nodes.

$ cat sample_mpi_hello_world.sh.3977.out
Hello world from processor q06.hpcc.dartmouth.edu, rank 1 out of 4 processors
Hello world from processor q06.hpcc.dartmouth.edu, rank 0 out of 4 processors
Hello world from processor q07.hpcc.dartmouth.edu, rank 2 out of 4 processors
Hello world from processor q07.hpcc.dartmouth.edu, rank 3 out of 4 processors

Sample Python lab walltime

In this lab we will create a simple python script, called invert_matrix.py which we will submit to the cluster. In addition we will explore what it is like for a job to run out of walltime.

For the purpose of this lab, we will use a conda environment which has the necessary packages installed via modules. For those who want to use python outside of this lab, then it is strongly encouraged to visit: Using conda to manage your own Python environments(external)

The above KBA will take you through creating a conda, environment so that you can manage you own python packages.

To get started, open a new file in your favorite editor and call it invert_matrix.py. Once created, paste in the following python code.

import numpy as np
import sys
for i in range(2,2001):
   x=np.random.rand(i,i)
   y=np.linalg.inv(x)
   z=np.dot(x,y)
   e=np.eye(i)
   r=z-e
   m=r.mean()
   if i%50 ==0:
    print( "i,mean",i,m )
    sys.stdout.flush()

Save the file, and test that the program works by issuing:

python invert.matrix.py

Next we will want to estimate how long the job will take to complete. A way to get an idea of that is to run an interactive job either by sshing directly to compute node, or by submitting for a slurm interactive job via srun. Lets submit for a slurm interactive job to estimate how much walltime we will need for our job.

$ srun --account=rc --cpus-per-task=8 --pty /bin/bash
$ module load python
$ export OMP_NUM_THREADS=8
$ time python invert_matrix.py
i,mean 50 -3.163405787045661e-17
i,mean 100 6.142970237536317e-17
i,mean 150 4.5949741833585534e-18
i,mean 200 3.893461968391442e-17
i,mean 250 -3.146382346137727e-17
i,mean 300 7.28979443833719e-17
i,mean 350 -2.1393282891571857e-17
i,mean 400 -1.0057995742663563e-16
i,mean 450 3.739472179051286e-17
i,mean 500 1.0962310076481413e-17
...
i,mean 2000 2.71473510528421e-16

real 4m36.904s
user 4m27.038s
sys 0m8.851s

Above you can see that I am using srun to create an interactive session, and asking for 8 cores on a compute node. Once the command executes you can see that I am on a new node, p04.

For this example, I will be using the time command. More information can be found about the time command by issuing “man time”.

After the interactive session has started I step through the steps I would normally to run my code, but I place the time command in front of the python command before pressing enter

$ time python invert_matrix.py

At the end, it will write out how long it took the job to complete. In the above example, you can see that it took 4 minutes and 36 seconds to complete. I now know that the amount of walltime I need should be at least 5 minutes.

Now lets try a batch example, and lets not give it enough walltime to see what happens.

Next we will create a batch script to submit to the cluster. Copy the below example in a new file within the directory that your invert_matrix.py file is located.

#!/bin/bash -l
# Name of the cluster account
# How long should I job run for
#SBATCH --time=00:02:00
# Number of CPU cores, in this case 8 cores
#SBATCH --ntasks-per-node=8
# Number of compute nodes to use, in this case 1
#SBATCH --nodes=1
# Name of the output files to be created. If not specified the outputs will be joined
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
# The code you want to run in your job
module load python
export OMP_NUM_THREADS=8
python invert_matrix.py

Once you have your submit script written, submit the job via the sbatch command.

$ sbatch sample_python_lab.sh
Submitted batch job 4156

Upon submitting you receive a job number. In the example above my job number is 4156.

As soon as the job starts, you will notice two output files created within the directory. In our case they are:

sample_python_lab.sh.4156.err sample_python_lab.sh.4156.out

$ ls -l
total 136
-rw-r--r-- 1  rc-users  218 May 14 16:20 invert_matrix.py
-rw-r--r-- 1  rc-users  568 May 14 18:44 sample_python_lab.sh
-rw-r--r-- 1  rc-users  107 May 14 18:44 sample_python_lab.sh.4156.err
-rw-r--r-- 1  rc-users 1309 May 14 18:46 sample_python_lab.sh.4156.out

Now that we see the .err and .out files. Go ahead and take a look at the .out file using the cat command:

$ cat sample_python_lab.sh.4156.out
i,mean 50 -9.642611316148498e-17
i,mean 100 1.87926695251823e-16
i,mean 150 1.8993939654816242e-16
i,mean 200 2.946282755137268e-17
...
i,mean 1950 -1.7521916145004967e-17

It looks like the job did not complete the code. If it ran to completion we would have expected to see the last line:

i,mean 2000 2.71473510528421e-16

Next, =take a look at the error file to see if it has any clues as to why our job aborted.

$ cat sample_python_lab.sh.4156.err
slurmstepd: error: *** JOB 4156 ON p04 CANCELLED AT 2025-05-14T18:46:49 DUE TO TIME LIMIT ***

After looking at the .err file, it is clear that our job ran out of walltime as indicated by the message above. In this case we requested 2 minutes of walltime, but should have requested at least 5.

Sample R lab (Hello World)

In this lab we will create a basic R script to print “Hello World!”. Then we will use the scheduler to submit the job via sbatch.

The first step of this process is to either move your R script / data to the cluster. Or, you can simply start by creating a file, and adding R code to the file.

Go ahead and create a new file with your favorite editor and name it sample.R

When you have created the R file, add the following lines:

# A simple R script to print hello world!
aString = "Hello World!"

print (aString)

We can test that the script works by issuing Rscript to execute to script:

$ Rscript sample.R
[1] "Hello World!"

Great, now we know that our R script at least works. Next, lets submit this as a job to the scheduler. In order to do so, we need to create a script which will tell the scheduler what we want it to do. Below is an example script that you may use. For the purpose of this lab I have labeled the example script sample_R.sh

#!/bin/bash -l
# How long should I job run for, wall time limit: --time=<hh:mm:ss>
#SBATCH --time=01:00:00
# Number of CPU cores, in this case 1 core
#SBATCH --ntasks=1
# Number of compute nodes to use
#SBATCH --nodes=1
# Name of the output files to be created. If not specified the outputs will be joined
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
# The code you want to run in your job
Rscript sample.R

Once your job script (the script you will se to submit to the cluster) and your code (the sample.R script) has been created. You can submit the job to the cluster with the sbatch command:

$ sbatch sample_R.sh
Submitted batch job 3957

The returned value shows a successful job submission followed by a job ID. Once the job completes, I should now have two new files created in the directory.

$ ls | grep 3957
sample_R.sh.3957.err
sample_R.sh.3957.out

To verify the job executed my R code, I should see the print statement of “Hello World!” inside the output file

$ cat sample_R.sh.3957.out
[1] "Hello World!"```

Discovery Cluster Details

SBATCH

On this page: