doc/html/quickstart.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Quick Start User Guide</h1>

 <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
 <p>Slurm is an open source,
 fault-tolerant, and highly scalable cluster management and job scheduling system
 for large and small Linux clusters. Slurm requires no kernel modifications for
 its operation and is relatively self-contained. As a cluster workload manager,
 Slurm has three key functions. First, it allocates exclusive and/or non-exclusive
 access to resources (compute nodes) to users for some duration of time so they
 can perform work. Second, it provides a framework for starting, executing, and
 monitoring work (normally a parallel job) on the set of allocated nodes. Finally,
 it arbitrates contention for resources by managing a queue of pending work.</p>

 <h2 id="arch">Architecture<a class="slurm_link" href="#arch"></a></h2>
 <p>As depicted in Figure 1, Slurm consists of a <b>slurmd</b> daemon running on
 each compute node and a central <b>slurmctld</b> daemon running on a management node
 (with optional fail-over twin).
 The <b>slurmd</b> daemons provide fault-tolerant hierarchical communications.
 The user commands include: <b>sacct</b>, <b>sacctmgr</b>, <b>salloc</b>,
 <b>sattach</b>, <b>sbatch</b>, <b>sbcast</b>, <b>scancel</b>, <b>scontrol</b>,
 <b>scrontab</b>, <b>sdiag</b>, <b>sh5util</b>, <b>sinfo</b>, <b>sprio</b>,
 <b>squeue</b>, <b>sreport</b>, <b>srun</b>, <b>sshare</b>, <b>sstat</b>,
 <b>strigger</b> and <b>sview</b>.
 All of the commands can run anywhere in the cluster.</p>

 <div class="figure">
   <img src="arch.gif" width=550><br>
   Figure 1. Slurm components
 </div>

 <p>The entities managed by these Slurm daemons, shown in Figure 2, include
 <b>nodes</b>, the compute resource in Slurm,
 <b>partitions</b>, which group nodes into logical (possibly overlapping) sets,
 <b>jobs</b>, or allocations of resources assigned to a user for
 a specified amount of time, and
 <b>job steps</b>, which are sets of (possibly parallel) tasks within a job.
 The partitions can be considered job queues, each of which has an assortment of
 constraints such as job size limit, job time limit, users permitted to use it, etc.
 Priority-ordered jobs are allocated nodes within a partition until the resources
 (nodes, processors, memory, etc.) within that partition are exhausted. Once
 a job is assigned a set of nodes, the user is able to initiate parallel work in
 the form of job steps in any configuration within the allocation. For instance,
 a single job step may be started that utilizes all nodes allocated to the job,
 or several job steps may independently use a portion of the allocation.</p>

 <div class="figure">
   <img src="entities.gif" width=550><br>
   Figure 2. Slurm entities
 </div>


 <h2 id="commands">Commands<a class="slurm_link" href="#commands"></a></h2>
 <p>Man pages exist for all Slurm daemons, commands, and API functions. The command
 option <span class="commandline">--help</span> also provides a brief summary of
 options. Note that the command options are all case sensitive.</p>

 <p><span class="commandline"><b>sacct</b></span> is used to report job or job
 step accounting information about active or completed jobs.</p>

 <p><span class="commandline"><b>salloc</b></span> is used to allocate resources
 for a job in real time. Typically this is used to allocate resources and spawn a shell.
 The shell is then used to execute srun commands to launch parallel tasks.</p>

 <p><span class="commandline"><b>sattach</b></span> is used to attach standard
 input, output, and error plus signal capabilities to a currently running
 job or job step. One can attach to and detach from jobs multiple times.</p>

 <p><span class="commandline"><b>sbatch</b></span> is used to submit a job script
 for later execution. The script will typically contain one or more srun commands
 to launch parallel tasks.</p>

 <p><span class="commandline"><b>sbcast</b></span> is used to transfer a file
 from local disk to local disk on the nodes allocated to a job. This can be
 used to effectively use diskless compute nodes or provide improved performance
 relative to a shared file system.</p>

 <p><span class="commandline"><b>scancel</b></span> is used to cancel a pending
 or running job or job step. It can also be used to send an arbitrary signal to
 all processes associated with a running job or job step.</p>

 <p><span class="commandline"><b>scontrol</b></span> is the administrative tool
 used to view and/or modify Slurm state. Note that many <span class="commandline">scontrol</span>
 commands can only be executed as user root.</p>

 <p><span class="commandline"><b>sinfo</b></span> reports the state of partitions
 and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting
 options.</p>

 <p><span class="commandline"><b>sprio</b></span> is used to display a detailed
 view of the components affecting a job's priority.</p>

 <p><span class="commandline"><b>squeue</b></span> reports the state of jobs or
 job steps. It has a wide variety of filtering, sorting, and formatting options.
 By default, it reports the running jobs in priority order and then the pending
 jobs in priority order.</p>

 <p><span class="commandline"><b>srun</b></span> is used to submit a job for
 execution or initiate job steps in real time.
 <span class="commandline">srun</span>
 has a wide variety of options to specify resource requirements, including: minimum
 and maximum node count, processor count, specific nodes to use or not use, and
 specific node characteristics (so much memory, disk space, certain required
 features, etc.).
 A job can contain multiple job steps executing sequentially or in parallel on
 independent or shared resources within the job's node allocation.</p>

 <p><span class="commandline"><b>sshare</b></span> displays detailed information
 about fairshare usage on the cluster. Note that this is only viable when using
 the priority/multifactor plugin.</p>

 <p><span class="commandline"><b>sstat</b></span> is used to get information
 about the resources utilized by a running job or job step.</p>

 <p><span class="commandline"><b>strigger</b></span> is used to set, get or
 view event triggers. Event triggers include things such as nodes going down
 or jobs approaching their time limit.</p>

 <p><span class="commandline"><b>sview</b></span> is a graphical user interface to
 get and update state information for jobs, partitions, and nodes managed by Slurm.</p>


 <h2 id="examples">Examples<a class="slurm_link" href="#examples"></a></h2>
 <p>First we determine what partitions exist on the system, what nodes
 they include, and general system state. This information is provided
 by the <span class="commandline">sinfo</span> command.
 In the example below we find there are two partitions: <i>debug</i>
 and <i>batch</i>.
 The <i>*</i> following the name <i>debug</i> indicates this is the
 default partition for submitted jobs.
 We see that both partitions are in an <i>UP</i> state.
 Some configurations may include partitions for larger jobs
 that are <i>DOWN</i> except on weekends or at night. The information
 about each partition may be split over more than one line so that
 nodes in different states can be identified.
 In this case, the two nodes <i>adev[1-2]</i> are <i>down</i>.
 The <i>*</i> following the state <i>down</i> indicate the nodes are
 not responding. Note the use of a concise expression for node
 name specification with a common prefix <i>adev</i> and numeric
 ranges or specific numbers identified. This format allows for
 very large clusters to be easily managed.
 The <span class="commandline">sinfo</span> command
 has many options to easily let you view the information of interest
 to you in whatever format you prefer.
 See the man page for more information.</p>
 <pre>
 adev0: sinfo
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 debug*       up      30:00     2  down* adev[1-2]
 debug*       up      30:00     3   idle adev[3-5]
 batch        up      30:00     3  down* adev[6,13,15]
 batch        up      30:00     3  alloc adev[7-8,14]
 batch        up      30:00     4   idle adev[9-12]
 </pre>

 <p>Next we determine what jobs exist on the system using the
 <span class="commandline">squeue</span> command. The
 <i>ST</i> field is job state.
 Two jobs are in a running state (<i>R</i> is an abbreviation
 for <i>Running</i>) while one job is in a pending state
 (<i>PD</i> is an abbreviation for <i>Pending</i>).
 The <i>TIME</i> field shows how long the jobs have run
 for using the format <i>days-hours:minutes:seconds</i>.
 The <i>NODELIST(REASON)</i> field indicates where the
 job is running or the reason it is still pending. Typical
 reasons for pending jobs are <i>Resources</i> (waiting
 for resources to become available) and <i>Priority</i>
 (queued behind a higher priority job).
 The <span class="commandline">squeue</span> command
 has many options to easily let you view the information of interest
 to you in whatever format you prefer.
 See the man page for more information.</p>
 <pre>
 adev0: squeue
 JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)
 65646     batch  chem  mike  R 24:19     2 adev[7-8]
 65647     batch   bio  joan  R  0:09     1 adev14
 65648     batch  math  phil PD  0:00     6 (Resources)
 </pre>

 <p>The <span class="commandline">scontrol</span> command
 can be used to report more detailed information about
 nodes, partitions, jobs, job steps, and configuration.
 It can also be used by system administrators to make
 configuration changes. A couple of examples are shown
 below. See the man page for more information.</p>
 <pre>
 adev0: scontrol show partition
 PartitionName=debug TotalNodes=5 TotalCPUs=40 RootOnly=NO
    Default=YES OverSubscribe=FORCE:4 PriorityTier=1 State=UP
    MaxTime=00:30:00 Hidden=NO
    MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
    Nodes=adev[1-5] NodeIndices=0-4

 PartitionName=batch TotalNodes=10 TotalCPUs=80 RootOnly=NO
    Default=NO OverSubscribe=FORCE:4 PriorityTier=1 State=UP
    MaxTime=16:00:00 Hidden=NO
    MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
    Nodes=adev[6-15] NodeIndices=5-14


 adev0: scontrol show node adev1
 NodeName=adev1 State=DOWN* CPUs=8 AllocCPUs=0
    RealMemory=4000 TmpDisk=0
    Sockets=2 Cores=4 Threads=1 Weight=1 Features=intel
    Reason=Not responding [slurm@06/02-14:01:24]

 65648     batch  math  phil PD  0:00     6 (Resources)
 adev0: scontrol show job
 JobId=65672 UserId=phil(5136) GroupId=phil(5136)
    Name=math
    Priority=4294901603 Partition=batch BatchFlag=1
    AllocNode:Sid=adev0:16726 TimeLimit=00:10:00 ExitCode=0:0
    StartTime=06/02-15:27:11 EndTime=06/02-15:37:11
    JobState=PENDING NodeList=(null) NodeListIndices=
    NumCPUs=24 ReqNodes=1 ReqS:C:T=1-65535:1-65535:1-65535
    OverSubscribe=1 Contiguous=0 CPUs/task=0 Licenses=(null)
    MinCPUs=1 MinSockets=1 MinCores=1 MinThreads=1
    MinMemory=0 MinTmpDisk=0 Features=(null)
    Dependency=(null) Account=(null) Requeue=1
    Reason=None Network=(null)
    ReqNodeList=(null) ReqNodeListIndices=
    ExcNodeList=(null) ExcNodeListIndices=
    SubmitTime=06/02-15:27:11 SuspendTime=None PreSusTime=0
    Command=/home/phil/math
    WorkDir=/home/phil
 </pre>

 <p>It is possible to create a resource allocation and launch
 the tasks for a job step in a single command line using the
 <span class="commandline">srun</span> command. Depending
 upon the MPI implementation used, MPI jobs may also be
 launched in this manner.
 See the <a href="#mpi">MPI</a> section for more MPI-specific information.
 In this example we execute <span class="commandline">/bin/hostname</span>
 on three nodes (<i>-N3</i>) and include task numbers on the output (<i>-l</i>).
 The default partition will be used.
 One task per node will be used by default.
 Note that the <span class="commandline">srun</span> command has
 many options available to control what resource are allocated
 and how tasks are distributed across those resources.</p>
 <pre>
 adev0: srun -N3 -l /bin/hostname
 0: adev3
 1: adev4
 2: adev5
 </pre>

 <p>This variation on the previous example executes
 <span class="commandline">/bin/hostname</span> in four tasks (<i>-n4</i>).
 One processor per task will be used by default (note that we don't specify
 a node count).</p>
 <pre>
 adev0: srun -n4 -l /bin/hostname
 0: adev3
 1: adev3
 2: adev3
 3: adev3
 </pre>

 <p>One common mode of operation is to submit a script for later execution.
 In this example the script name is <i>my.script</i> and we explicitly use
 the nodes adev9 and adev10 (<i>-w "adev[9-10]"</i>, note the use of a
 node range expression).
 We also explicitly state that the subsequent job steps will spawn four tasks
 each, which will ensure that our allocation contains at least four processors
 (one processor per task to be launched).
 The output will appear in the file my.stdout ("-o my.stdout").
 This script contains a timelimit for the job embedded within itself.
 Other options can be supplied as desired by using a prefix of "#SBATCH" followed
 by the option at the beginning of the script (before any commands to be executed
 in the script).
 Options supplied on the command line would override any options specified within
 the script.
 Note that my.script contains the command <span class="commandline">/bin/hostname</span>
 that executed on the first node in the allocation (where the script runs) plus
 two job steps initiated using the <span class="commandline">srun</span> command
 and executed sequentially.</p>
 <pre>
 adev0: cat my.script
 #!/bin/sh
 #SBATCH --time=1
 /bin/hostname
 srun -l /bin/hostname
 srun -l /bin/pwd

 adev0: sbatch -n4 -w &quot;adev[9-10]&quot; -o my.stdout my.script
 sbatch: Submitted batch job 469

 adev0: cat my.stdout
 adev9
 0: adev9
 1: adev9
 2: adev10
 3: adev10
 0: /home/jette
 1: /home/jette
 2: /home/jette
 3: /home/jette
 </pre>

 <p>The final mode of operation is to create a resource allocation
 and spawn job steps within that allocation.
 The <span class="commandline">salloc</span> command is used
 to create a resource allocation and typically start a shell within
 that allocation.
 One or more job steps would typically be executed within that allocation
 using the <span class="commandline">srun</span> command to launch the tasks
 (depending upon the type of MPI being used, the launch mechanism may
 differ, see <a href="#mpi">MPI</a> details below).
 Finally the shell created by <span class="commandline">salloc</span> would
 be terminated using the <i>exit</i> command.
 Slurm does not automatically migrate executable or data files
 to the nodes allocated to a job.
 Either the files must exists on local disk or in some global file system
 (e.g. NFS or Lustre).
 We provide the tool <span class="commandline">sbcast</span> to transfer
 files to local storage on allocated nodes using Slurm's hierarchical
 communications.
 In this example we use <span class="commandline">sbcast</span> to transfer
 the executable program <i>a.out</i> to <i>/tmp/joe.a.out</i> on local storage
 of the allocated nodes.
 After executing the program, we delete it from local storage</p>
 <pre>
 tux0: salloc -N1024 bash
 $ sbcast a.out /tmp/joe.a.out
 Granted job allocation 471
 $ srun /tmp/joe.a.out
 Result is 3.14159
 $ srun rm /tmp/joe.a.out
 $ exit
 salloc: Relinquishing job allocation 471
 </pre>

 <p>In this example, we submit a batch job, get its status, and cancel it. </p>
 <pre>
 adev0: sbatch test
 srun: jobid 473 submitted

 adev0: squeue
 JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)
   473 batch     test jill R  00:00 1     adev9

 adev0: scancel 473

 adev0: squeue
 JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)
 </pre>


 <h2 id="best">Best Practices, Large Job Counts
 <a class="slurm_link" href="#best"></a>
 </h2>

 <p>Consider putting related work into a single Slurm job with multiple job
 steps both for performance reasons and ease of management.
 Each Slurm job can contain a multitude of job steps and the overhead in
 Slurm for managing job steps is much lower than that of individual jobs.</p>

 <p><a href="job_array.html">Job arrays</a> are an efficient mechanism of
 managing a collection of batch jobs with identical resource requirements.
 Most Slurm commands can manage job arrays either as individual elements (tasks)
 or as a single entity (e.g. delete an entire job array in a single command).</p>


 <h2 id="mpi">MPI<a class="slurm_link" href="#mpi"></a></h2>

 <p>MPI use depends upon the type of MPI being used.
 There are three fundamentally different modes of operation used
 by these various MPI implementations.
 <ol>
 <li>Slurm directly launches the tasks and performs initialization of
 communications through the PMI2 or PMIx APIs. (Supported by most
 modern MPI implementations.)</li>
 <li>Slurm creates a resource allocation for the job and then
 mpirun launches tasks using Slurm's infrastructure (older versions of
 OpenMPI).</li>
 <li>Slurm creates a resource allocation for the job and then
 mpirun launches tasks using some mechanism other than Slurm,
 such as SSH or RSH.
 These tasks are initiated outside of Slurm's monitoring
 or control. Slurm's epilog should be configured to purge
 these tasks when the job's allocation is relinquished. The
 use of pam_slurm_adopt is also strongly recommended.</li>
 </ol>
 <p>Links to instructions for using several varieties of MPI
 with Slurm are provided below.
 <ul>
 <li><a href="mpi_guide.html#intel_mpi">Intel MPI</a></li>
 <li><a href="mpi_guide.html#mpich2">MPICH2</a></li>
 <li><a href="mpi_guide.html#mvapich2">MVAPICH2</a></li>
 <li><a href="mpi_guide.html#open_mpi">Open MPI</a></li>
 </ul></p>

 <p style="text-align:center;">Last modified 29 June 2021</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Quick Start User Guide</h1>

	<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
	<p>Slurm is an open source,
	fault-tolerant, and highly scalable cluster management and job scheduling system
	for large and small Linux clusters. Slurm requires no kernel modifications for
	its operation and is relatively self-contained. As a cluster workload manager,
	Slurm has three key functions. First, it allocates exclusive and/or non-exclusive
	access to resources (compute nodes) to users for some duration of time so they
	can perform work. Second, it provides a framework for starting, executing, and
	monitoring work (normally a parallel job) on the set of allocated nodes. Finally,
	it arbitrates contention for resources by managing a queue of pending work.</p>

	<h2 id="arch">Architecture<a class="slurm_link" href="#arch"></a></h2>
	<p>As depicted in Figure 1, Slurm consists of a <b>slurmd</b> daemon running on
	each compute node and a central <b>slurmctld</b> daemon running on a management node
	(with optional fail-over twin).
	The <b>slurmd</b> daemons provide fault-tolerant hierarchical communications.
	The user commands include: <b>sacct</b>, <b>sacctmgr</b>, <b>salloc</b>,
	<b>sattach</b>, <b>sbatch</b>, <b>sbcast</b>, <b>scancel</b>, <b>scontrol</b>,
	<b>scrontab</b>, <b>sdiag</b>, <b>sh5util</b>, <b>sinfo</b>, <b>sprio</b>,
	<b>squeue</b>, <b>sreport</b>, <b>srun</b>, <b>sshare</b>, <b>sstat</b>,
	<b>strigger</b> and <b>sview</b>.
	All of the commands can run anywhere in the cluster.</p>

	<div class="figure">
	<img src="arch.gif" width=550><br>
	Figure 1. Slurm components
	</div>

	<p>The entities managed by these Slurm daemons, shown in Figure 2, include
	<b>nodes</b>, the compute resource in Slurm,
	<b>partitions</b>, which group nodes into logical (possibly overlapping) sets,
	<b>jobs</b>, or allocations of resources assigned to a user for
	a specified amount of time, and
	<b>job steps</b>, which are sets of (possibly parallel) tasks within a job.
	The partitions can be considered job queues, each of which has an assortment of
	constraints such as job size limit, job time limit, users permitted to use it, etc.
	Priority-ordered jobs are allocated nodes within a partition until the resources
	(nodes, processors, memory, etc.) within that partition are exhausted. Once
	a job is assigned a set of nodes, the user is able to initiate parallel work in
	the form of job steps in any configuration within the allocation. For instance,
	a single job step may be started that utilizes all nodes allocated to the job,
	or several job steps may independently use a portion of the allocation.</p>

	<div class="figure">
	<img src="entities.gif" width=550><br>
	Figure 2. Slurm entities
	</div>


	<h2 id="commands">Commands<a class="slurm_link" href="#commands"></a></h2>
	<p>Man pages exist for all Slurm daemons, commands, and API functions. The command
	option <span class="commandline">--help</span> also provides a brief summary of
	options. Note that the command options are all case sensitive.</p>

	<p><span class="commandline"><b>sacct</b></span> is used to report job or job
	step accounting information about active or completed jobs.</p>

	<p><span class="commandline"><b>salloc</b></span> is used to allocate resources
	for a job in real time. Typically this is used to allocate resources and spawn a shell.
	The shell is then used to execute srun commands to launch parallel tasks.</p>

	<p><span class="commandline"><b>sattach</b></span> is used to attach standard
	input, output, and error plus signal capabilities to a currently running
	job or job step. One can attach to and detach from jobs multiple times.</p>

	<p><span class="commandline"><b>sbatch</b></span> is used to submit a job script
	for later execution. The script will typically contain one or more srun commands
	to launch parallel tasks.</p>

	<p><span class="commandline"><b>sbcast</b></span> is used to transfer a file
	from local disk to local disk on the nodes allocated to a job. This can be
	used to effectively use diskless compute nodes or provide improved performance
	relative to a shared file system.</p>

	<p><span class="commandline"><b>scancel</b></span> is used to cancel a pending
	or running job or job step. It can also be used to send an arbitrary signal to
	all processes associated with a running job or job step.</p>

	<p><span class="commandline"><b>scontrol</b></span> is the administrative tool
	used to view and/or modify Slurm state. Note that many <span class="commandline">scontrol</span>
	commands can only be executed as user root.</p>

	<p><span class="commandline"><b>sinfo</b></span> reports the state of partitions
	and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting
	options.</p>

	<p><span class="commandline"><b>sprio</b></span> is used to display a detailed
	view of the components affecting a job's priority.</p>

	<p><span class="commandline"><b>squeue</b></span> reports the state of jobs or
	job steps. It has a wide variety of filtering, sorting, and formatting options.
	By default, it reports the running jobs in priority order and then the pending
	jobs in priority order.</p>

	<p><span class="commandline"><b>srun</b></span> is used to submit a job for
	execution or initiate job steps in real time.
	<span class="commandline">srun</span>
	has a wide variety of options to specify resource requirements, including: minimum
	and maximum node count, processor count, specific nodes to use or not use, and
	specific node characteristics (so much memory, disk space, certain required
	features, etc.).
	A job can contain multiple job steps executing sequentially or in parallel on
	independent or shared resources within the job's node allocation.</p>

	<p><span class="commandline"><b>sshare</b></span> displays detailed information
	about fairshare usage on the cluster. Note that this is only viable when using
	the priority/multifactor plugin.</p>

	<p><span class="commandline"><b>sstat</b></span> is used to get information
	about the resources utilized by a running job or job step.</p>

	<p><span class="commandline"><b>strigger</b></span> is used to set, get or
	view event triggers. Event triggers include things such as nodes going down
	or jobs approaching their time limit.</p>

	<p><span class="commandline"><b>sview</b></span> is a graphical user interface to
	get and update state information for jobs, partitions, and nodes managed by Slurm.</p>


	<h2 id="examples">Examples<a class="slurm_link" href="#examples"></a></h2>
	<p>First we determine what partitions exist on the system, what nodes
	they include, and general system state. This information is provided
	by the <span class="commandline">sinfo</span> command.
	In the example below we find there are two partitions: <i>debug</i>
	and <i>batch</i>.
	The <i>*</i> following the name <i>debug</i> indicates this is the
	default partition for submitted jobs.
	We see that both partitions are in an <i>UP</i> state.
	Some configurations may include partitions for larger jobs
	that are <i>DOWN</i> except on weekends or at night. The information
	about each partition may be split over more than one line so that
	nodes in different states can be identified.
	In this case, the two nodes <i>adev[1-2]</i> are <i>down</i>.
	The <i>*</i> following the state <i>down</i> indicate the nodes are
	not responding. Note the use of a concise expression for node
	name specification with a common prefix <i>adev</i> and numeric
	ranges or specific numbers identified. This format allows for
	very large clusters to be easily managed.
	The <span class="commandline">sinfo</span> command
	has many options to easily let you view the information of interest
	to you in whatever format you prefer.
	See the man page for more information.</p>
	<pre>
	adev0: sinfo
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	debug* up 30:00 2 down* adev[1-2]
	debug* up 30:00 3 idle adev[3-5]
	batch up 30:00 3 down* adev[6,13,15]
	batch up 30:00 3 alloc adev[7-8,14]
	batch up 30:00 4 idle adev[9-12]
	</pre>

	<p>Next we determine what jobs exist on the system using the
	<span class="commandline">squeue</span> command. The
	<i>ST</i> field is job state.
	Two jobs are in a running state (<i>R</i> is an abbreviation
	for <i>Running</i>) while one job is in a pending state
	(<i>PD</i> is an abbreviation for <i>Pending</i>).
	The <i>TIME</i> field shows how long the jobs have run
	for using the format <i>days-hours:minutes:seconds</i>.
	The <i>NODELIST(REASON)</i> field indicates where the
	job is running or the reason it is still pending. Typical
	reasons for pending jobs are <i>Resources</i> (waiting
	for resources to become available) and <i>Priority</i>
	(queued behind a higher priority job).
	The <span class="commandline">squeue</span> command
	has many options to easily let you view the information of interest
	to you in whatever format you prefer.
	See the man page for more information.</p>
	<pre>
	adev0: squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	65646 batch chem mike R 24:19 2 adev[7-8]
	65647 batch bio joan R 0:09 1 adev14
	65648 batch math phil PD 0:00 6 (Resources)
	</pre>

	<p>The <span class="commandline">scontrol</span> command
	can be used to report more detailed information about
	nodes, partitions, jobs, job steps, and configuration.
	It can also be used by system administrators to make
	configuration changes. A couple of examples are shown
	below. See the man page for more information.</p>
	<pre>
	adev0: scontrol show partition
	PartitionName=debug TotalNodes=5 TotalCPUs=40 RootOnly=NO
	Default=YES OverSubscribe=FORCE:4 PriorityTier=1 State=UP
	MaxTime=00:30:00 Hidden=NO
	MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
	Nodes=adev[1-5] NodeIndices=0-4

	PartitionName=batch TotalNodes=10 TotalCPUs=80 RootOnly=NO
	Default=NO OverSubscribe=FORCE:4 PriorityTier=1 State=UP
	MaxTime=16:00:00 Hidden=NO
	MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
	Nodes=adev[6-15] NodeIndices=5-14


	adev0: scontrol show node adev1
	NodeName=adev1 State=DOWN* CPUs=8 AllocCPUs=0
	RealMemory=4000 TmpDisk=0
	Sockets=2 Cores=4 Threads=1 Weight=1 Features=intel
	Reason=Not responding [slurm@06/02-14:01:24]

	65648 batch math phil PD 0:00 6 (Resources)
	adev0: scontrol show job
	JobId=65672 UserId=phil(5136) GroupId=phil(5136)
	Name=math
	Priority=4294901603 Partition=batch BatchFlag=1
	AllocNode:Sid=adev0:16726 TimeLimit=00:10:00 ExitCode=0:0
	StartTime=06/02-15:27:11 EndTime=06/02-15:37:11
	JobState=PENDING NodeList=(null) NodeListIndices=
	NumCPUs=24 ReqNodes=1 ReqS:C:T=1-65535:1-65535:1-65535
	OverSubscribe=1 Contiguous=0 CPUs/task=0 Licenses=(null)
	MinCPUs=1 MinSockets=1 MinCores=1 MinThreads=1
	MinMemory=0 MinTmpDisk=0 Features=(null)
	Dependency=(null) Account=(null) Requeue=1
	Reason=None Network=(null)
	ReqNodeList=(null) ReqNodeListIndices=
	ExcNodeList=(null) ExcNodeListIndices=
	SubmitTime=06/02-15:27:11 SuspendTime=None PreSusTime=0
	Command=/home/phil/math
	WorkDir=/home/phil
	</pre>

	<p>It is possible to create a resource allocation and launch
	the tasks for a job step in a single command line using the
	<span class="commandline">srun</span> command. Depending
	upon the MPI implementation used, MPI jobs may also be
	launched in this manner.
	See the <a href="#mpi">MPI</a> section for more MPI-specific information.
	In this example we execute <span class="commandline">/bin/hostname</span>
	on three nodes (<i>-N3</i>) and include task numbers on the output (<i>-l</i>).
	The default partition will be used.
	One task per node will be used by default.
	Note that the <span class="commandline">srun</span> command has
	many options available to control what resource are allocated
	and how tasks are distributed across those resources.</p>
	<pre>
	adev0: srun -N3 -l /bin/hostname
	0: adev3
	1: adev4
	2: adev5
	</pre>

	<p>This variation on the previous example executes
	<span class="commandline">/bin/hostname</span> in four tasks (<i>-n4</i>).
	One processor per task will be used by default (note that we don't specify
	a node count).</p>
	<pre>
	adev0: srun -n4 -l /bin/hostname
	0: adev3
	1: adev3
	2: adev3
	3: adev3
	</pre>

	<p>One common mode of operation is to submit a script for later execution.
	In this example the script name is <i>my.script</i> and we explicitly use
	the nodes adev9 and adev10 (<i>-w "adev[9-10]"</i>, note the use of a
	node range expression).
	We also explicitly state that the subsequent job steps will spawn four tasks
	each, which will ensure that our allocation contains at least four processors
	(one processor per task to be launched).
	The output will appear in the file my.stdout ("-o my.stdout").
	This script contains a timelimit for the job embedded within itself.
	Other options can be supplied as desired by using a prefix of "#SBATCH" followed
	by the option at the beginning of the script (before any commands to be executed
	in the script).
	Options supplied on the command line would override any options specified within
	the script.
	Note that my.script contains the command <span class="commandline">/bin/hostname</span>
	that executed on the first node in the allocation (where the script runs) plus
	two job steps initiated using the <span class="commandline">srun</span> command
	and executed sequentially.</p>
	<pre>
	adev0: cat my.script
	#!/bin/sh
	#SBATCH --time=1
	/bin/hostname
	srun -l /bin/hostname
	srun -l /bin/pwd

	adev0: sbatch -n4 -w "adev[9-10]" -o my.stdout my.script
	sbatch: Submitted batch job 469

	adev0: cat my.stdout
	adev9
	0: adev9
	1: adev9
	2: adev10
	3: adev10
	0: /home/jette
	1: /home/jette
	2: /home/jette
	3: /home/jette
	</pre>

	<p>The final mode of operation is to create a resource allocation
	and spawn job steps within that allocation.
	The <span class="commandline">salloc</span> command is used
	to create a resource allocation and typically start a shell within
	that allocation.
	One or more job steps would typically be executed within that allocation
	using the <span class="commandline">srun</span> command to launch the tasks
	(depending upon the type of MPI being used, the launch mechanism may
	differ, see <a href="#mpi">MPI</a> details below).
	Finally the shell created by <span class="commandline">salloc</span> would
	be terminated using the <i>exit</i> command.
	Slurm does not automatically migrate executable or data files
	to the nodes allocated to a job.
	Either the files must exists on local disk or in some global file system
	(e.g. NFS or Lustre).
	We provide the tool <span class="commandline">sbcast</span> to transfer
	files to local storage on allocated nodes using Slurm's hierarchical
	communications.
	In this example we use <span class="commandline">sbcast</span> to transfer
	the executable program <i>a.out</i> to <i>/tmp/joe.a.out</i> on local storage
	of the allocated nodes.
	After executing the program, we delete it from local storage</p>
	<pre>
	tux0: salloc -N1024 bash
	$ sbcast a.out /tmp/joe.a.out
	Granted job allocation 471
	$ srun /tmp/joe.a.out
	Result is 3.14159
	$ srun rm /tmp/joe.a.out
	$ exit
	salloc: Relinquishing job allocation 471
	</pre>

	<p>In this example, we submit a batch job, get its status, and cancel it. </p>
	<pre>
	adev0: sbatch test
	srun: jobid 473 submitted

	adev0: squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	473 batch test jill R 00:00 1 adev9

	adev0: scancel 473

	adev0: squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	</pre>


	<h2 id="best">Best Practices, Large Job Counts
	<a class="slurm_link" href="#best"></a>
	</h2>

	<p>Consider putting related work into a single Slurm job with multiple job
	steps both for performance reasons and ease of management.
	Each Slurm job can contain a multitude of job steps and the overhead in
	Slurm for managing job steps is much lower than that of individual jobs.</p>

	<p><a href="job_array.html">Job arrays</a> are an efficient mechanism of
	managing a collection of batch jobs with identical resource requirements.
	Most Slurm commands can manage job arrays either as individual elements (tasks)
	or as a single entity (e.g. delete an entire job array in a single command).</p>


	<h2 id="mpi">MPI<a class="slurm_link" href="#mpi"></a></h2>

	<p>MPI use depends upon the type of MPI being used.
	There are three fundamentally different modes of operation used
	by these various MPI implementations.
	<ol>
	<li>Slurm directly launches the tasks and performs initialization of
	communications through the PMI2 or PMIx APIs. (Supported by most
	modern MPI implementations.)</li>
	<li>Slurm creates a resource allocation for the job and then
	mpirun launches tasks using Slurm's infrastructure (older versions of
	OpenMPI).</li>
	<li>Slurm creates a resource allocation for the job and then
	mpirun launches tasks using some mechanism other than Slurm,
	such as SSH or RSH.
	These tasks are initiated outside of Slurm's monitoring
	or control. Slurm's epilog should be configured to purge
	these tasks when the job's allocation is relinquished. The
	use of pam_slurm_adopt is also strongly recommended.</li>
	</ol>
	<p>Links to instructions for using several varieties of MPI
	with Slurm are provided below.
	<ul>
	<li><a href="mpi_guide.html#intel_mpi">Intel MPI</a></li>
	<li><a href="mpi_guide.html#mpich2">MPICH2</a></li>
	<li><a href="mpi_guide.html#mvapich2">MVAPICH2</a></li>
	<li><a href="mpi_guide.html#open_mpi">Open MPI</a></li>
	</ul></p>

	<p style="text-align:center;">Last modified 29 June 2021</p>

	<!--#include virtual="footer.txt"-->