doc/html/quickstart.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Quick Start User Guide</h1>

 <h2>Overview</h2>
 <p>The Simple Linux Utility for Resource Management (SLURM) is an open source,
 fault-tolerant, and highly scalable cluster management and job scheduling system
 for large and small Linux clusters. SLURM requires no kernel modifications for
 its operation and is relatively self-contained. As a cluster resource manager,
 SLURM has three key functions. First, it allocates exclusive and/or non-exclusive
 access to resources (compute nodes) to users for some duration of time so they
 can perform work. Second, it provides a framework for starting, executing, and
 monitoring work (normally a parallel job) on the set of allocated nodes. Finally,
 it arbitrates conflicting requests for resources by managing a queue of pending
 work.</p>

 <h2>Architecture</h2>
 <p>As depicted in Figure 1, SLURM consists of a <b>slurmd</b> daemon running on
 each compute node and a central <b>slurmctld</b> daemon running on a management node
 (with optional fail-over twin).
 The <b>slurmd</b> daemons provide fault-tolerant hierarchical communciations.
 The user commands include: <b>salloc</b>, <b>sattach</b>, <b>sbatch</b>,
 <b>sbcast</b>, <b>scancel</b>, <b>sinfo</b>, <b>srun</b>,
 <b>smap</b>, <b>squeue</b>, and <b>scontrol</b>.
 All of the commands can run anywhere in the cluster.</p>

 <div class="figure">
   <img src="arch.gif" width="600"><br />
   Figure 1. SLURM components
 </div>

 <p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>,
 the compute resource in SLURM, <b>partitions</b>, which group nodes into logical
 sets, <b>jobs</b>, or allocations of resources assigned to a user for
 a specified amount of time, and <b>job steps</b>, which are sets of (possibly
 parallel) tasks within a job.
 The partitions can be considered job queues, each of which has an assortment of
 constraints such as job size limit, job time limit, users permitted to use it, etc.
 Priority-ordered jobs are allocated nodes within a partition until the resources
 (nodes, processors, memory, etc.) within that partition are exhausted. Once
 a job is assigned a set of nodes, the user is able to initiate parallel work in
 the form of job steps in any configuration within the allocation. For instance,
 a single job step may be started that utilizes all nodes allocated to the job,
 or several job steps may independently use a portion of the allocation.</p>

 <div class="figure">
   <img src="entities.gif" width="291" height="218"><br />
   Figure 2. SLURM entities
 </div>

 <p class="footer"><a href="#top">top</a></p>

 <h2>Commands</h2>
 <p>Man pages exist for all SLURM daemons, commands, and API functions. The command
 option <span class="commandline">--help</span> also provides a brief summary of
 options. Note that the command options are all case insensitive.</p>

 <p><span class="commandline"><b>salloc</b></span> is used to allocate resources
 for a job in real time. Typically this is used to allocate resources and spawn a shell.
 The shell is then used to execute srun commands to launch parallel tasks.</p>

 <p><span class="commandline"><b>sattach</b></span> is used to attach standard
 input, output, and error plus signal capabilities to a currently running
 job or job step. One can attach to and detach from jobs multiple times.</p>

 <p><span class="commandline"><b>sbatch</b></span> is used to submit a job script
 for later execution. The script will typically contain one or more srun commands
 to launch parallel tasks.</p>

 <p><span class="commandline"><b>sbcast</b></span> is used to transfer a file
 from local disk to local disk on the nodes allocated to a job. This can be
 used to effectively use diskless compute nodes or provide improved performance
 relative to a shared file system.</p>

 <p><span class="commandline"><b>scancel</b></span> is used to cancel a pending
 or running job or job step. It can also be used to send an arbitrary signal to
 all processes associated with a running job or job step.</p>

 <p><span class="commandline"><b>scontrol</b></span> is the administrative tool
 used to view and/or modify SLURM state. Note that many <span class="commandline">scontrol</span>
 commands can only be executed as user root.</p>

 <p><span class="commandline"><b>sinfo</b></span> reports the state of partitions
 and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting
 options.</p>

 <p><span class="commandline"><b>squeue</b></span> reports the state of jobs or
 job steps. It has a wide variety of filtering, sorting, and formatting options.
 By default, it reports the running jobs in priority order and then the pending
 jobs in priority order.</p>

 <p><span class="commandline"><b>srun</b></span> is used to submit a job for
 execution or initiate job steps in real time.
 <span class="commandline">srun</span>
 has a wide variety of options to specify resource requirements, including: minimum
 and maximum node count, processor count, specific nodes to use or not use, and
 specific node characteristics (so much memory, disk space, certain required
 features, etc.).
 A job can contain multiple job steps executing sequentially or in parallel on
 independent or shared nodes within the job's node allocation.</p>

 <p><span class="commandline"><b>smap</b></span> reports state information for
 jobs, partitions, and nodes managed by SLURM, but graphically displays the
 information to reflect network topology.</p>

 <p><span class="commandline"><b>sview</b></span> is a graphical user interface to
 get and update state information for jobs, partitions, and nodes managed by SLURM.</p>

 <p class="footer"><a href="#top">top</a></p>

 <h2>Examples</h2>
 <p>Execute <span class="commandline">/bin/hostname</span> on four nodes (<span class="commandline">-N4</span>).
 Include task numbers on the output (<span class="commandline">-l</span>). The
 default partition will be used. One task per node will be used by default. </p>
 <pre>
 adev0: srun -N4 -l /bin/hostname
 0: adev9
 1: adev10
 2: adev11
 3: adev12
 </pre> <p>Execute <span class="commandline">/bin/hostname</span> in four
 tasks (<span class="commandline">-n4</span>). Include task numbers on the output
 (<span class="commandline">-l</span>). The default partition will be used. One
 processor per task will be used by default (note that we don't specify a node
 count).</p>
 <pre>
 adev0: srun -n4 -l /bin/hostname
 0: adev9
 1: adev9
 2: adev10
 3: adev10
 </pre> <p>Submit the script my.script for later execution.
 Explicitly use the nodes adev9 and adev10 ("-w "adev[9-10]", note the use of a
 node range expression).
 We also explicitly state that the subsequent job steps will spawn four tasks
 each, which will insure that our allocation contains at least four processors
 (one processor per task to be launched).
 The output will appear in the file my.stdout ("-o my.stdout").
 This script contains a timelimit for the job embedded within itself.
 Other options can be supplied as desired by using a prefix of "#SBATCH" followed
 by the option at the beginning of the script (before any commands to be executed
 in the script).
 Options supplied on the command line would override any options specified within
 the script.
 Note that my.script contains the command <span class="commandline">/bin/hostname</span>
 that executed on the first node in the allocation (where the script runs) plus
 two job steps initiated using the <span class="commandline">srun</span> command
 and executed sequentially.</p>
 <pre>
 adev0: cat my.script
 #!/bin/sh
 #SBATCH --time=1
 /bin/hostname
 srun -l /bin/hostname
 srun -l /bin/pwd

 adev0: sbatch -n4 -w &quot;adev[9-10]&quot; -o my.stdout my.script
 sbatch: Submitted batch job 469

 adev0: cat my.stdout
 adev9
 0: adev9
 1: adev9
 2: adev10
 3: adev10
 0: /home/jette
 1: /home/jette
 2: /home/jette
 3: /home/jette
 </pre>

 <p>Submit a job, get its status, and cancel it. </p>
 <pre>
 adev0: sbatch my.sleeper
 srun: jobid 473 submitted

 adev0: squeue
   JOBID PARTITION NAME     USER  ST TIME  NODES NODELIST(REASON)
     473 batch     my.sleep jette R  00:00 1     adev9

 adev0: scancel 473

 adev0: squeue
   JOBID PARTITION NAME     USER  ST TIME  NODES NODELIST(REASON)
 </pre>

 <p>Get the SLURM partition and node status.</p>
 <pre>
 adev0: sinfo
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 debug     up     00:30:00      8   idle adev[0-7]
 batch     up     12:00:00      1   down adev8
                  12:00:00      7   idle adev[9-15]

 </pre>
 <p class="footer"><a href="#top">top</a></p>


 <h2><a name="mpi">MPI</a></h2>
 <p>MPI use depends upon the type of MPI being used.
 There are three fundamentally different modes of operation used
 by these various MPI implementation.
 <ol>
 <li>SLURM directly launches the tasks and performs initialization
 of communications (Quadrics MPI, MPICH2, MPICH-GM, MPICH-MX,
 MVAPICH, MVAPICH2 and some MPICH1 modes).</li>
 <li>SLURM creates a resource allocation for the job and then
 mpirun launches tasks using SLURM's infrastructure (OpenMPI,
 LAM/MPI and HP-MPI).</li>
 <li>SLURM creates a resource allocation for the job and then
 mpirun launches tasks using some mechanism other than SLURM,
 such as SSH or RSH (BlueGene MPI and some MPICH1 modes).
 These tasks initiated outside of SLURM's monitoring
 or control. SLURM's epilog should be configured to purge
 these tasks when the job's allocation is relinquished. </li>
 </ol>
 <p>Instructions for using several varieties of MPI with SLURM are
 provided below.</p>

 <p> <a href="http://www.open-mpi.org/"><b>Open MPI</b></a> relies upon
 SLURM to allocate resources for the job and then mpirun to initiate the
 tasks. When using <span class="commandline">salloc</span> command,
 <span class="commandline">mpirun</span>'s -nolocal option is recommended.
 For example:
 <pre>
 $ salloc -n4 sh    # allocates 4 processors and spawns shell for job
 &gt; mpirun -np 4 -nolocal a.out
 &gt; exit          # exits shell spawned by initial salloc command
 </pre>
 <p>Note that any direct use of <span class="commandline">srun</span>
 will only launch one task per node when the LAM/MPI plugin is used.
 To launch more than one task per node usng the
 <span class="commandline">srun</span> command, the <i>--mpi=none</i>
 option will be required to explicitly disable the LAM/MPI plugin.</p>

 <p> <a href="http://www.quadrics.com/"><b>Quadrics MPI</b></a> relies upon SLURM to
 allocate resources for the job and <span class="commandline">srun</span>
 to initiate the tasks. One would build the MPI program in the normal manner
 then initiate it using a command line of this sort:</p>
 <pre>
 $ srun [options] &lt;program&gt; [program args]
 </pre>

 <p> <a href="http://www.lam-mpi.org/"><b>LAM/MPI</b></a> relies upon the SLURM
 <span class="commandline">salloc</span> or <span class="commandline">sbatch</span>
 command to allocate. In either case, specify
 the maximum number of tasks required for the job. Then execute the
 <span class="commandline">lamboot</span> command to start lamd daemons.
 <span class="commandline">lamboot</span> utilizes SLURM's
 <span class="commandline">srun</span> command to launch these daemons.
 Do not directly execute the <span class="commandline">srun</span> command
 to launch LAM/MPI tasks. For example:
 <pre>
 $ salloc -n16 sh  # allocates 16 processors and spawns shell for job
 &gt; lamboot
 &gt; mpirun -np 16 foo args
 1234 foo running on adev0 (o)
 2345 foo running on adev1
 etc.
 &gt; lamclean
 &gt; lamhalt
 &gt; exit         # exits shell spawned by initial srun command
 </pre>
 <p>Note that any direct use of <span class="commandline">srun</span>
 will only launch one task per node when the LAM/MPI plugin is configured
 as the default plugin.  To launch more than one task per node usng the
 <span class="commandline">srun</span> command, the <i>--mpi=none</i>
 option would be required to explicitly disable the LAM/MPI plugin
 if that is the system default.</p>

 <p class="footer"><a href="#top">top</a></p>

 <p><a href="http://www.hp.com/go/mpi"><b>HP-MPI</b></a> uses the
 <span class="commandline">mpirun</span> command with the <b>-srun</b>
 option to launch jobs. For example:
 <pre>
 $MPI_ROOT/bin/mpirun -TCP -srun -N8 ./a.out
 </pre></p>

 <p><a href="http://www-unix.mcs.anl.gov/mpi/mpich2/"><b>MPICH2</b></a> jobs
 are launched using the <b>srun</b> command. Just link your program with
 SLURM's implementation of the PMI library so that tasks can communicate
 host and port information at startup. (The system administrator can add
 these option to the mpicc and mpif77 commands directly, so the user will not
 need to bother). For example:
 <pre>
 $ mpicc -L&lt;path_to_slurm_lib&gt; -lpmi ...
 $ srun -n20 a.out
 </pre>
 <b>NOTES:</b>
 <ul>
 <li>Some MPICH2 functions are not currently supported by the PMI
 libary integrated with SLURM</li>
 <li>Set the environment variable <b>PMI_DEBUG</b> to a numeric value
 of 1 or higher for the PMI libary to print debugging information</li>
 </ul></p>

 <p><a href="http://www.myri.com/scs/download-mpichgm.html"><b>MPICH-GM</b></a>
 jobs can be launched directly by <b>srun</b> command.
 SLURM's <i>mpichgm</i> MPI plugin must be used to establish communications
 between the laucnhed tasks. This can be accomplished either using the SLURM
 configuration parameter <i>MpiDefault=mpichgm</i> in <b>slurm.conf</b>
 or srun's <i>--mpi=mpichgm</i> option.
 <pre>
 $ mpicc ...
 $ srun -n16 --mpi=mpichgm a.out
 </pre>

 <p><a href="http://www.myri.com/scs/download-mpichmx.html"><b>MPICH-MX</b></a>
 jobs can be launched directly by <b>srun</b> command.
 SLURM's <i>mpichmx</i> MPI plugin must be used to establish communications
 between the laucnhed tasks. This can be accomplished either using the SLURM
 configuration parameter <i>MpiDefault=mpichmx</i> in <b>slurm.conf</b>
 or srun's <i>--mpi=mpichmx</i> option.
 <pre>
 $ mpicc ...
 $ srun -n16 --mpi=mpichmx a.out
 </pre>

 <p><a href="http://nowlab.cse.ohio-state.edu/projects/mpi-iba"><b>MVAPICH</b></a>
 jobs can be launched directly by <b>srun</b> command.
 SLURM's <i>mvapich</i> MPI plugin must be used to establish communications
 between the laucnhed tasks. This can be accomplished either using the SLURM
 configuration parameter <i>MpiDefault=mvapich</i> in <b>slurm.conf</b>
 or srun's <i>--mpi=mvapich</i> option.
 <pre>
 $ mpicc ...
 $ srun -n16 --mpi=mvapich a.out
 </pre>
 <b>NOTE:</b> If MVAPICH is used in the shared memory model, with all tasks
 running on a single node, then use the <i>mpich1_shmem</i> MPI plugin instead.<br>
 <b>NOTE (for system administrators):</b> Configure
 <i>PropagateResourceLimitsExcept=MEMLOCK</i> in <b>slurm.conf</b> and
 start the <i>slurmd</i> daemons with an unlimited locked memory limit.
 For more details, see
 <a href="http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-420007.2.3">MVAPICH</a>
 documentation for "CQ or QP Creation failure".</p>

 <p><a href="http://nowlab.cse.ohio-state.edu/projects/mpi-iba"><b>MVAPICH2</b></a>
 jobs can be launched directly by <b>srun</b> command.
 SLURM's <i>none</i> MPI plugin must be used to establish communications
 between the laucnhed tasks. This can be accomplished either using the SLURM
 configuration parameter <i>MpiDefault=none</i> in <b>slurm.conf</b>
 or srun's <i>--mpi=none</i> option. The program must also be linked with
 SLURM's implementation of the PMI library so that tasks can communicate
 host and port information at startup. (The system administrator can add
 these option to the mpicc and mpif77 commands directly, so the user will not
 need to bother).  <b>Do not use SLURM's MVAPICH plugin for MVAPICH2.</b>
 <pre>
 $ mpicc -L&lt;path_to_slurm_lib&gt; -lpmi ...
 $ srun -n16 --mpi=none a.out
 </pre>

 <p><a href="http://www.research.ibm.com/bluegene/"><b>BlueGene MPI</b></a> relies
 upon SLURM to create the resource allocation and then uses the native
 <span class="commandline">mpirun</span> command to launch tasks.
 Build a job script containing one or more invocations of the
 <span class="commandline">mpirun</span> command. Then submit
 the script to SLURM using <span class="commandline">sbatch</span>.
 For example:</p>
 <pre>
 $ sbatch -N512 my.script
 </pre>
 <p>Note that the node count specified with the <i>-N</i> option indicates
 the base partition count.
 See <a href="bluegene.html">BlueGene User and Administrator Guide</a>
 for more information.</p>

 <p><a href="http://www-unix.mcs.anl.gov/mpi/mpich1/"><b>MPICH1</b></a>
 development ceased in 2005. It is recommended that you convert to
 MPICH2 or some other MPI implementation.
 If you still want to use MPICH1, note that it has several different
 programming models. If you are using the shared memory model
 (<i>DEFAULT_DEVICE=ch_shmem</i> in the mpirun script), then initiate
 the tasks using the <span class="commandline">srun</span> command
 with the <i>--mpi=mpich1_shmem</i> option.</p>
 <pre>
 $ srun -n16 --mpi=mpich1_shmem a.out
 </pre>

 <p>If you are using MPICH P4 (<i>DEFAULT_DEVICE=ch_p4</i> in
 the mpirun script) and SLURM version 1.2.11 or newer,
 then it is recommended that you apply the patch in the SLURM
 distribution's file <i>contribs/mpich1.slurm.patch</i>.
 Follow directions within the file to rebuild MPICH.
 Applications must be relinked with the new library.
 Initiate tasks using the
 <span class="commandline">srun</span> command with the
 <i>--mpi=mpich1_p4</i> option.</p>
 <pre>
 $ srun -n16 --mpi=mpich1_p4 a.out
 </pre>
 <p>Note that SLURM launches one task per node and the MPICH
 library linked within your applications launches the other
 tasks with shared memory used for communications between them.
 The only real anomaly is that all output from all spawned tasks
 on a node appear to SLURM as coming from the one task that it
 launched. If the srun --label option is used, the task ID labels
 will be misleading.</p>

 <p>Other MPICH1 programming models current rely upon the SLURM
 <span class="commandline">salloc</span> or
 <span class="commandline">sbatch</span> command to allocate resources.
 In either case, specify the maximum number of tasks required for the job.
 You may then need to build a list of hosts to be used and use that
 as an argument to the mpirun command.
 For example:
 <pre>
 $ cat mpich.sh
 #!/bin/bash
 srun hostname -s | sort -u >slurm.hosts
 mpirun [options] -machinefile slurm.hosts a.out
 rm -f slurm.hosts
 $ sbatch -n16 mpich.sh
 sbatch: Submitted batch job 1234
 </pre>
 <p>Note that in this example, mpirun uses the rsh command to launch
 tasks. These tasks are not managed by SLURM since they are launched
 outside of its control.</p>

 <p style="text-align:center;">Last modified 14 August 2007</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Quick Start User Guide</h1>

	<h2>Overview</h2>
	<p>The Simple Linux Utility for Resource Management (SLURM) is an open source,
	fault-tolerant, and highly scalable cluster management and job scheduling system
	for large and small Linux clusters. SLURM requires no kernel modifications for
	its operation and is relatively self-contained. As a cluster resource manager,
	SLURM has three key functions. First, it allocates exclusive and/or non-exclusive
	access to resources (compute nodes) to users for some duration of time so they
	can perform work. Second, it provides a framework for starting, executing, and
	monitoring work (normally a parallel job) on the set of allocated nodes. Finally,
	it arbitrates conflicting requests for resources by managing a queue of pending
	work.</p>

	<h2>Architecture</h2>
	<p>As depicted in Figure 1, SLURM consists of a <b>slurmd</b> daemon running on
	each compute node and a central <b>slurmctld</b> daemon running on a management node
	(with optional fail-over twin).
	The <b>slurmd</b> daemons provide fault-tolerant hierarchical communciations.
	The user commands include: <b>salloc</b>, <b>sattach</b>, <b>sbatch</b>,
	<b>sbcast</b>, <b>scancel</b>, <b>sinfo</b>, <b>srun</b>,
	<b>smap</b>, <b>squeue</b>, and <b>scontrol</b>.
	All of the commands can run anywhere in the cluster.</p>

	<div class="figure">
	<img src="arch.gif" width="600"><br />
	Figure 1. SLURM components
	</div>

	<p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>,
	the compute resource in SLURM, <b>partitions</b>, which group nodes into logical
	sets, <b>jobs</b>, or allocations of resources assigned to a user for
	a specified amount of time, and <b>job steps</b>, which are sets of (possibly
	parallel) tasks within a job.
	The partitions can be considered job queues, each of which has an assortment of
	constraints such as job size limit, job time limit, users permitted to use it, etc.
	Priority-ordered jobs are allocated nodes within a partition until the resources
	(nodes, processors, memory, etc.) within that partition are exhausted. Once
	a job is assigned a set of nodes, the user is able to initiate parallel work in
	the form of job steps in any configuration within the allocation. For instance,
	a single job step may be started that utilizes all nodes allocated to the job,
	or several job steps may independently use a portion of the allocation.</p>

	<div class="figure">
	<img src="entities.gif" width="291" height="218"><br />
	Figure 2. SLURM entities
	</div>

	<p class="footer"><a href="#top">top</a></p>

	<h2>Commands</h2>
	<p>Man pages exist for all SLURM daemons, commands, and API functions. The command
	option <span class="commandline">--help</span> also provides a brief summary of
	options. Note that the command options are all case insensitive.</p>

	<p><span class="commandline"><b>salloc</b></span> is used to allocate resources
	for a job in real time. Typically this is used to allocate resources and spawn a shell.
	The shell is then used to execute srun commands to launch parallel tasks.</p>

	<p><span class="commandline"><b>sattach</b></span> is used to attach standard
	input, output, and error plus signal capabilities to a currently running
	job or job step. One can attach to and detach from jobs multiple times.</p>

	<p><span class="commandline"><b>sbatch</b></span> is used to submit a job script
	for later execution. The script will typically contain one or more srun commands
	to launch parallel tasks.</p>

	<p><span class="commandline"><b>sbcast</b></span> is used to transfer a file
	from local disk to local disk on the nodes allocated to a job. This can be
	used to effectively use diskless compute nodes or provide improved performance
	relative to a shared file system.</p>

	<p><span class="commandline"><b>scancel</b></span> is used to cancel a pending
	or running job or job step. It can also be used to send an arbitrary signal to
	all processes associated with a running job or job step.</p>

	<p><span class="commandline"><b>scontrol</b></span> is the administrative tool
	used to view and/or modify SLURM state. Note that many <span class="commandline">scontrol</span>
	commands can only be executed as user root.</p>

	<p><span class="commandline"><b>sinfo</b></span> reports the state of partitions
	and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting
	options.</p>

	<p><span class="commandline"><b>squeue</b></span> reports the state of jobs or
	job steps. It has a wide variety of filtering, sorting, and formatting options.
	By default, it reports the running jobs in priority order and then the pending
	jobs in priority order.</p>

	<p><span class="commandline"><b>srun</b></span> is used to submit a job for
	execution or initiate job steps in real time.
	<span class="commandline">srun</span>
	has a wide variety of options to specify resource requirements, including: minimum
	and maximum node count, processor count, specific nodes to use or not use, and
	specific node characteristics (so much memory, disk space, certain required
	features, etc.).
	A job can contain multiple job steps executing sequentially or in parallel on
	independent or shared nodes within the job's node allocation.</p>

	<p><span class="commandline"><b>smap</b></span> reports state information for
	jobs, partitions, and nodes managed by SLURM, but graphically displays the
	information to reflect network topology.</p>

	<p><span class="commandline"><b>sview</b></span> is a graphical user interface to
	get and update state information for jobs, partitions, and nodes managed by SLURM.</p>

	<p class="footer"><a href="#top">top</a></p>

	<h2>Examples</h2>
	<p>Execute <span class="commandline">/bin/hostname</span> on four nodes (<span class="commandline">-N4</span>).
	Include task numbers on the output (<span class="commandline">-l</span>). The
	default partition will be used. One task per node will be used by default. </p>
	<pre>
	adev0: srun -N4 -l /bin/hostname
	0: adev9
	1: adev10
	2: adev11
	3: adev12
	</pre> <p>Execute <span class="commandline">/bin/hostname</span> in four
	tasks (<span class="commandline">-n4</span>). Include task numbers on the output
	(<span class="commandline">-l</span>). The default partition will be used. One
	processor per task will be used by default (note that we don't specify a node
	count).</p>
	<pre>
	adev0: srun -n4 -l /bin/hostname
	0: adev9
	1: adev9
	2: adev10
	3: adev10
	</pre> <p>Submit the script my.script for later execution.
	Explicitly use the nodes adev9 and adev10 ("-w "adev[9-10]", note the use of a
	node range expression).
	We also explicitly state that the subsequent job steps will spawn four tasks
	each, which will insure that our allocation contains at least four processors
	(one processor per task to be launched).
	The output will appear in the file my.stdout ("-o my.stdout").
	This script contains a timelimit for the job embedded within itself.
	Other options can be supplied as desired by using a prefix of "#SBATCH" followed
	by the option at the beginning of the script (before any commands to be executed
	in the script).
	Options supplied on the command line would override any options specified within
	the script.
	Note that my.script contains the command <span class="commandline">/bin/hostname</span>
	that executed on the first node in the allocation (where the script runs) plus
	two job steps initiated using the <span class="commandline">srun</span> command
	and executed sequentially.</p>
	<pre>
	adev0: cat my.script
	#!/bin/sh
	#SBATCH --time=1
	/bin/hostname
	srun -l /bin/hostname
	srun -l /bin/pwd

	adev0: sbatch -n4 -w "adev[9-10]" -o my.stdout my.script
	sbatch: Submitted batch job 469

	adev0: cat my.stdout
	adev9
	0: adev9
	1: adev9
	2: adev10
	3: adev10
	0: /home/jette
	1: /home/jette
	2: /home/jette
	3: /home/jette
	</pre>

	<p>Submit a job, get its status, and cancel it. </p>
	<pre>
	adev0: sbatch my.sleeper
	srun: jobid 473 submitted

	adev0: squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	473 batch my.sleep jette R 00:00 1 adev9

	adev0: scancel 473

	adev0: squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	</pre>

	<p>Get the SLURM partition and node status.</p>
	<pre>
	adev0: sinfo
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	debug up 00:30:00 8 idle adev[0-7]
	batch up 12:00:00 1 down adev8
	12:00:00 7 idle adev[9-15]

	</pre>
	<p class="footer"><a href="#top">top</a></p>



	<h2><a name="mpi">MPI</a></h2>
	<p>MPI use depends upon the type of MPI being used.
	There are three fundamentally different modes of operation used
	by these various MPI implementation.
	<ol>
	<li>SLURM directly launches the tasks and performs initialization
	of communications (Quadrics MPI, MPICH2, MPICH-GM, MPICH-MX,
	MVAPICH, MVAPICH2 and some MPICH1 modes).</li>
	<li>SLURM creates a resource allocation for the job and then
	mpirun launches tasks using SLURM's infrastructure (OpenMPI,
	LAM/MPI and HP-MPI).</li>
	<li>SLURM creates a resource allocation for the job and then
	mpirun launches tasks using some mechanism other than SLURM,
	such as SSH or RSH (BlueGene MPI and some MPICH1 modes).
	These tasks initiated outside of SLURM's monitoring
	or control. SLURM's epilog should be configured to purge
	these tasks when the job's allocation is relinquished. </li>
	</ol>
	<p>Instructions for using several varieties of MPI with SLURM are
	provided below.</p>

	<p> <a href="http://www.open-mpi.org/"><b>Open MPI</b></a> relies upon
	SLURM to allocate resources for the job and then mpirun to initiate the
	tasks. When using <span class="commandline">salloc</span> command,
	<span class="commandline">mpirun</span>'s -nolocal option is recommended.
	For example:
	<pre>
	$ salloc -n4 sh # allocates 4 processors and spawns shell for job
	> mpirun -np 4 -nolocal a.out
	> exit # exits shell spawned by initial salloc command
	</pre>
	<p>Note that any direct use of <span class="commandline">srun</span>
	will only launch one task per node when the LAM/MPI plugin is used.
	To launch more than one task per node usng the
	<span class="commandline">srun</span> command, the <i>--mpi=none</i>
	option will be required to explicitly disable the LAM/MPI plugin.</p>

	<p> <a href="http://www.quadrics.com/"><b>Quadrics MPI</b></a> relies upon SLURM to
	allocate resources for the job and <span class="commandline">srun</span>
	to initiate the tasks. One would build the MPI program in the normal manner
	then initiate it using a command line of this sort:</p>
	<pre>
	$ srun [options] <program> [program args]
	</pre>

	<p> <a href="http://www.lam-mpi.org/"><b>LAM/MPI</b></a> relies upon the SLURM
	<span class="commandline">salloc</span> or <span class="commandline">sbatch</span>
	command to allocate. In either case, specify
	the maximum number of tasks required for the job. Then execute the
	<span class="commandline">lamboot</span> command to start lamd daemons.
	<span class="commandline">lamboot</span> utilizes SLURM's
	<span class="commandline">srun</span> command to launch these daemons.
	Do not directly execute the <span class="commandline">srun</span> command
	to launch LAM/MPI tasks. For example:
	<pre>
	$ salloc -n16 sh # allocates 16 processors and spawns shell for job
	> lamboot
	> mpirun -np 16 foo args
	1234 foo running on adev0 (o)
	2345 foo running on adev1
	etc.
	> lamclean
	> lamhalt
	> exit # exits shell spawned by initial srun command
	</pre>
	<p>Note that any direct use of <span class="commandline">srun</span>
	will only launch one task per node when the LAM/MPI plugin is configured
	as the default plugin. To launch more than one task per node usng the
	<span class="commandline">srun</span> command, the <i>--mpi=none</i>
	option would be required to explicitly disable the LAM/MPI plugin
	if that is the system default.</p>

	<p class="footer"><a href="#top">top</a></p>

	<p><a href="http://www.hp.com/go/mpi"><b>HP-MPI</b></a> uses the
	<span class="commandline">mpirun</span> command with the <b>-srun</b>
	option to launch jobs. For example:
	<pre>
	$MPI_ROOT/bin/mpirun -TCP -srun -N8 ./a.out
	</pre></p>

	<p><a href="http://www-unix.mcs.anl.gov/mpi/mpich2/"><b>MPICH2</b></a> jobs
	are launched using the <b>srun</b> command. Just link your program with
	SLURM's implementation of the PMI library so that tasks can communicate
	host and port information at startup. (The system administrator can add
	these option to the mpicc and mpif77 commands directly, so the user will not
	need to bother). For example:
	<pre>
	$ mpicc -L<path_to_slurm_lib> -lpmi ...
	$ srun -n20 a.out
	</pre>
	<b>NOTES:</b>
	<ul>
	<li>Some MPICH2 functions are not currently supported by the PMI
	libary integrated with SLURM</li>
	<li>Set the environment variable <b>PMI_DEBUG</b> to a numeric value
	of 1 or higher for the PMI libary to print debugging information</li>
	</ul></p>

	<p><a href="http://www.myri.com/scs/download-mpichgm.html"><b>MPICH-GM</b></a>
	jobs can be launched directly by <b>srun</b> command.
	SLURM's <i>mpichgm</i> MPI plugin must be used to establish communications
	between the laucnhed tasks. This can be accomplished either using the SLURM
	configuration parameter <i>MpiDefault=mpichgm</i> in <b>slurm.conf</b>
	or srun's <i>--mpi=mpichgm</i> option.
	<pre>
	$ mpicc ...
	$ srun -n16 --mpi=mpichgm a.out
	</pre>

	<p><a href="http://www.myri.com/scs/download-mpichmx.html"><b>MPICH-MX</b></a>
	jobs can be launched directly by <b>srun</b> command.
	SLURM's <i>mpichmx</i> MPI plugin must be used to establish communications
	between the laucnhed tasks. This can be accomplished either using the SLURM
	configuration parameter <i>MpiDefault=mpichmx</i> in <b>slurm.conf</b>
	or srun's <i>--mpi=mpichmx</i> option.
	<pre>
	$ mpicc ...
	$ srun -n16 --mpi=mpichmx a.out
	</pre>

	<p><a href="http://nowlab.cse.ohio-state.edu/projects/mpi-iba"><b>MVAPICH</b></a>
	jobs can be launched directly by <b>srun</b> command.
	SLURM's <i>mvapich</i> MPI plugin must be used to establish communications
	between the laucnhed tasks. This can be accomplished either using the SLURM
	configuration parameter <i>MpiDefault=mvapich</i> in <b>slurm.conf</b>
	or srun's <i>--mpi=mvapich</i> option.
	<pre>
	$ mpicc ...
	$ srun -n16 --mpi=mvapich a.out
	</pre>
	<b>NOTE:</b> If MVAPICH is used in the shared memory model, with all tasks
	running on a single node, then use the <i>mpich1_shmem</i> MPI plugin instead.<br>
	<b>NOTE (for system administrators):</b> Configure
	<i>PropagateResourceLimitsExcept=MEMLOCK</i> in <b>slurm.conf</b> and
	start the <i>slurmd</i> daemons with an unlimited locked memory limit.
	For more details, see
	<a href="http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-420007.2.3">MVAPICH</a>
	documentation for "CQ or QP Creation failure".</p>

	<p><a href="http://nowlab.cse.ohio-state.edu/projects/mpi-iba"><b>MVAPICH2</b></a>
	jobs can be launched directly by <b>srun</b> command.
	SLURM's <i>none</i> MPI plugin must be used to establish communications
	between the laucnhed tasks. This can be accomplished either using the SLURM
	configuration parameter <i>MpiDefault=none</i> in <b>slurm.conf</b>
	or srun's <i>--mpi=none</i> option. The program must also be linked with
	SLURM's implementation of the PMI library so that tasks can communicate
	host and port information at startup. (The system administrator can add
	these option to the mpicc and mpif77 commands directly, so the user will not
	need to bother). <b>Do not use SLURM's MVAPICH plugin for MVAPICH2.</b>
	<pre>
	$ mpicc -L<path_to_slurm_lib> -lpmi ...
	$ srun -n16 --mpi=none a.out
	</pre>

	<p><a href="http://www.research.ibm.com/bluegene/"><b>BlueGene MPI</b></a> relies
	upon SLURM to create the resource allocation and then uses the native
	<span class="commandline">mpirun</span> command to launch tasks.
	Build a job script containing one or more invocations of the
	<span class="commandline">mpirun</span> command. Then submit
	the script to SLURM using <span class="commandline">sbatch</span>.
	For example:</p>
	<pre>
	$ sbatch -N512 my.script
	</pre>
	<p>Note that the node count specified with the <i>-N</i> option indicates
	the base partition count.
	See <a href="bluegene.html">BlueGene User and Administrator Guide</a>
	for more information.</p>

	<p><a href="http://www-unix.mcs.anl.gov/mpi/mpich1/"><b>MPICH1</b></a>
	development ceased in 2005. It is recommended that you convert to
	MPICH2 or some other MPI implementation.
	If you still want to use MPICH1, note that it has several different
	programming models. If you are using the shared memory model
	(<i>DEFAULT_DEVICE=ch_shmem</i> in the mpirun script), then initiate
	the tasks using the <span class="commandline">srun</span> command
	with the <i>--mpi=mpich1_shmem</i> option.</p>
	<pre>
	$ srun -n16 --mpi=mpich1_shmem a.out
	</pre>

	<p>If you are using MPICH P4 (<i>DEFAULT_DEVICE=ch_p4</i> in
	the mpirun script) and SLURM version 1.2.11 or newer,
	then it is recommended that you apply the patch in the SLURM
	distribution's file <i>contribs/mpich1.slurm.patch</i>.
	Follow directions within the file to rebuild MPICH.
	Applications must be relinked with the new library.
	Initiate tasks using the
	<span class="commandline">srun</span> command with the
	<i>--mpi=mpich1_p4</i> option.</p>
	<pre>
	$ srun -n16 --mpi=mpich1_p4 a.out
	</pre>
	<p>Note that SLURM launches one task per node and the MPICH
	library linked within your applications launches the other
	tasks with shared memory used for communications between them.
	The only real anomaly is that all output from all spawned tasks
	on a node appear to SLURM as coming from the one task that it
	launched. If the srun --label option is used, the task ID labels
	will be misleading.</p>

	<p>Other MPICH1 programming models current rely upon the SLURM
	<span class="commandline">salloc</span> or
	<span class="commandline">sbatch</span> command to allocate resources.
	In either case, specify the maximum number of tasks required for the job.
	You may then need to build a list of hosts to be used and use that
	as an argument to the mpirun command.
	For example:
	<pre>
	$ cat mpich.sh
	#!/bin/bash
	srun hostname -s \| sort -u >slurm.hosts
	mpirun [options] -machinefile slurm.hosts a.out
	rm -f slurm.hosts
	$ sbatch -n16 mpich.sh
	sbatch: Submitted batch job 1234
	</pre>
	<p>Note that in this example, mpirun uses the rsh command to launch
	tasks. These tasks are not managed by SLURM since they are launched
	outside of its control.</p>

	<p style="text-align:center;">Last modified 14 August 2007</p>

	<!--#include virtual="footer.txt"-->