Table of Contents
Both the srun and sbatch command have similar capabilities and set of parameters. srun is used to submit a job for execution in real time and blocks the terminal. You will not be able to issue other commands while the program executes and must have an open connection while the program runs. It is generally recommended to use srun for quick test runs and for simple workflows.
Running simple program
To queue a quick test program use:
srun -n4 -N1 <program run command>
here we have speciefied to allocate one node (-N1) and four tasks(cores) (-N4), and use defaults for other parameters.
Interactive Session
A common use case for srun is to open an interactive shell session on a cluster.
To start a interaactive session run:
srun -n1 --partition=<partition_name> --pty bash
This will allow direct access to a single node in a specific partition defined by parition_name once it is available.
For more information on partitions, click here. For a list of all srun parameters, click here.
Using sbatch
The typical way to run a job on a cluster is to write a submission script and run the sbatch command. This allows for greater control and parameter passing. Also sbatch jobs are stored in internal storage awaiting to be executed and the user can log out without affecting the submitted job.
Writing a run script
*** Note: Take a look at example_run_script.sh for a complete run script **
The very first line of the script must be the shebang. Such as,
#!/bin/bash
Directives
We can include sbatch parameter directives after the shebang to specify the parameters we wish to pass to SLURM. sbatch has the same parameters as srun. For a list of all sbatch parameters click here
Some example run script directives would be,
#SBATCH --job-name=name_of_job
#SBATCH --output=output_file_name.txt
#SBATCH --error=error_file_name.log
#SBATCH --ntasks=1
The above directives specify the job name, standard out, standard error file names and the number of tasks. If output and error are not specified, the output and error will be written to slurm_job_name.out by default.
Shell Commands
Finally, we can simply append the shell command to instruct SLURM to the desired program.
program_run_command program
For example to run a python script we would write:
python python_Script.py
Parallel Programs
To run a MPI program, we would write:
module load OpenMPI
srun MPI_program.mpi
It is recommended to use srun inside of the sbatch run script, as srun automatically launches the required processes.
module load OpenMPI loads the MPI runtime library prior to running the program. For more information on modules, click here.
Submitting the run script
To use a run script to submit a job use:
sbatch --partition=partition_name run_script.sh
Customizing resource allocation
One of the important advantages of using Spiedie is the flexibility of hardware resources available to the user. You can tailor your resource request to suit the needs of your program.
Proper resource allocation also ensures your program runs as fast and efficiently as possibile and does not halt unexepectedly due to hardware resource shortages such as memory. It is also best practice to allocate the right amount of resources for SLURM to work as efficient as possible for the entire cluster.
Using Spiedie-specific Directvies or Features
One way to make sure your programs run properly is to run it on the correct partition, as stated above.
You can also use the feature flag to help properly allocate resources. For more information on partitions, click here.
To make use of a feature, such as the KNL nodes use:
srun -N1 -n40 -C="knl" ./program_to_run
The above command requests 40 cores for the program to run, which is only available on the Knights Landing Nodes. The -C (constraint) flag ensures that the program is only run on KNL nodes.
Increase memory allocation
In order to increase the default memory allocation (2GB), you can use the –mem flag for srun and sbatch to specify the memory needed per node.
For example:
srun -n1 --mem=4G ./program_to_run
The above job requests for 4 GB of memory for the default one node.
You can also request memory per core using the –mem-per-cpu flag.