| <!--#include virtual="header.txt"--> |
| |
| <h1>Quick Start User Guide</h1> |
| |
| <h2>Overview</h2> |
| <p>The Simple Linux Utility for Resource Management (SLURM) is an open source, |
| fault-tolerant, and highly scalable cluster management and job scheduling system |
| for large and small Linux clusters. SLURM requires no kernel modifications for |
| its operation and is relatively self-contained. As a cluster resource manager, |
| SLURM has three key functions. First, it allocates exclusive and/or non-exclusive |
| access to resources (compute nodes) to users for some duration of time so they |
| can perform work. Second, it provides a framework for starting, executing, and |
| monitoring work (normally a parallel job) on the set of allocated nodes. Finally, |
| it arbitrates conflicting requests for resources by managing a queue of pending |
| work.</p> |
| |
| <h2>Architecture</h2> |
| <p>As depicted in Figure 1, SLURM consists of a <b>slurmd</b> daemon running on |
| each compute node and a central <b>slurmctld</b> daemon running on a management node |
| (with optional fail-over twin). |
| The <b>slurmd</b> daemons provide fault-tolerant hierarchical communciations. |
| The user commands include: <b>salloc</b>, <b>sattach</b>, <b>sbatch</b>, |
| <b>sbcast</b>, <b>scancel</b>, <b>sinfo</b>, <b>srun</b>, |
| <b>smap</b>, <b>squeue</b>, and <b>scontrol</b>. |
| All of the commands can run anywhere in the cluster.</p> |
| |
| <div class="figure"> |
| <img src="arch.gif" width="600"><br /> |
| Figure 1. SLURM components |
| </div> |
| |
| <p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>, |
| the compute resource in SLURM, <b>partitions</b>, which group nodes into logical |
| sets, <b>jobs</b>, or allocations of resources assigned to a user for |
| a specified amount of time, and <b>job steps</b>, which are sets of (possibly |
| parallel) tasks within a job. |
| The partitions can be considered job queues, each of which has an assortment of |
| constraints such as job size limit, job time limit, users permitted to use it, etc. |
| Priority-ordered jobs are allocated nodes within a partition until the resources |
| (nodes, processors, memory, etc.) within that partition are exhausted. Once |
| a job is assigned a set of nodes, the user is able to initiate parallel work in |
| the form of job steps in any configuration within the allocation. For instance, |
| a single job step may be started that utilizes all nodes allocated to the job, |
| or several job steps may independently use a portion of the allocation.</p> |
| |
| <div class="figure"> |
| <img src="entities.gif" width="291" height="218"><br /> |
| Figure 2. SLURM entities |
| </div> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>Commands</h2> |
| <p>Man pages exist for all SLURM daemons, commands, and API functions. The command |
| option <span class="commandline">--help</span> also provides a brief summary of |
| options. Note that the command options are all case insensitive.</p> |
| |
| <p><span class="commandline"><b>salloc</b></span> is used to allocate resources |
| for a job in real time. Typically this is used to allocate resources and spawn a shell. |
| The shell is then used to execute srun commands to launch parallel tasks.</p> |
| |
| <p><span class="commandline"><b>sattach</b></span> is used to attach standard |
| input, output, and error plus signal capabilities to a currently running |
| job or job step. One can attach to and detach from jobs multiple times.</p> |
| |
| <p><span class="commandline"><b>sbatch</b></span> is used to submit a job script |
| for later execution. The script will typically contain one or more srun commands |
| to launch parallel tasks.</p> |
| |
| <p><span class="commandline"><b>sbcast</b></span> is used to transfer a file |
| from local disk to local disk on the nodes allocated to a job. This can be |
| used to effectively use diskless compute nodes or provide improved performance |
| relative to a shared file system.</p> |
| |
| <p><span class="commandline"><b>scancel</b></span> is used to cancel a pending |
| or running job or job step. It can also be used to send an arbitrary signal to |
| all processes associated with a running job or job step.</p> |
| |
| <p><span class="commandline"><b>scontrol</b></span> is the administrative tool |
| used to view and/or modify SLURM state. Note that many <span class="commandline">scontrol</span> |
| commands can only be executed as user root.</p> |
| |
| <p><span class="commandline"><b>sinfo</b></span> reports the state of partitions |
| and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting |
| options.</p> |
| |
| <p><span class="commandline"><b>squeue</b></span> reports the state of jobs or |
| job steps. It has a wide variety of filtering, sorting, and formatting options. |
| By default, it reports the running jobs in priority order and then the pending |
| jobs in priority order.</p> |
| |
| <p><span class="commandline"><b>srun</b></span> is used to submit a job for |
| execution or initiate job steps in real time. |
| <span class="commandline">srun</span> |
| has a wide variety of options to specify resource requirements, including: minimum |
| and maximum node count, processor count, specific nodes to use or not use, and |
| specific node characteristics (so much memory, disk space, certain required |
| features, etc.). |
| A job can contain multiple job steps executing sequentially or in parallel on |
| independent or shared nodes within the job's node allocation.</p> |
| |
| <p><span class="commandline"><b>smap</b></span> reports state information for |
| jobs, partitions, and nodes managed by SLURM, but graphically displays the |
| information to reflect network topology.</p> |
| |
| <p><span class="commandline"><b>sview</b></span> is a graphical user interface to |
| get and update state information for jobs, partitions, and nodes managed by SLURM.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>Examples</h2> |
| <p>Execute <span class="commandline">/bin/hostname</span> on four nodes (<span class="commandline">-N4</span>). |
| Include task numbers on the output (<span class="commandline">-l</span>). The |
| default partition will be used. One task per node will be used by default. </p> |
| <pre> |
| adev0: srun -N4 -l /bin/hostname |
| 0: adev9 |
| 1: adev10 |
| 2: adev11 |
| 3: adev12 |
| </pre> <p>Execute <span class="commandline">/bin/hostname</span> in four |
| tasks (<span class="commandline">-n4</span>). Include task numbers on the output |
| (<span class="commandline">-l</span>). The default partition will be used. One |
| processor per task will be used by default (note that we don't specify a node |
| count).</p> |
| <pre> |
| adev0: srun -n4 -l /bin/hostname |
| 0: adev9 |
| 1: adev9 |
| 2: adev10 |
| 3: adev10 |
| </pre> <p>Submit the script my.script for later execution. |
| Explicitly use the nodes adev9 and adev10 ("-w "adev[9-10]", note the use of a |
| node range expression). |
| We also explicitly state that the subsequent job steps will spawn four tasks |
| each, which will insure that our allocation contains at least four processors |
| (one processor per task to be launched). |
| The output will appear in the file my.stdout ("-o my.stdout"). |
| This script contains a timelimit for the job embedded within itself. |
| Other options can be supplied as desired by using a prefix of "#SBATCH" followed |
| by the option at the beginning of the script (before any commands to be executed |
| in the script). |
| Options supplied on the command line would override any options specified within |
| the script. |
| Note that my.script contains the command <span class="commandline">/bin/hostname</span> |
| that executed on the first node in the allocation (where the script runs) plus |
| two job steps initiated using the <span class="commandline">srun</span> command |
| and executed sequentially.</p> |
| <pre> |
| adev0: cat my.script |
| #!/bin/sh |
| #SBATCH --time=1 |
| /bin/hostname |
| srun -l /bin/hostname |
| srun -l /bin/pwd |
| |
| adev0: sbatch -n4 -w "adev[9-10]" -o my.stdout my.script |
| sbatch: Submitted batch job 469 |
| |
| adev0: cat my.stdout |
| adev9 |
| 0: adev9 |
| 1: adev9 |
| 2: adev10 |
| 3: adev10 |
| 0: /home/jette |
| 1: /home/jette |
| 2: /home/jette |
| 3: /home/jette |
| </pre> |
| |
| <p>Submit a job, get its status, and cancel it. </p> |
| <pre> |
| adev0: sbatch my.sleeper |
| srun: jobid 473 submitted |
| |
| adev0: squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 473 batch my.sleep jette R 00:00 1 adev9 |
| |
| adev0: scancel 473 |
| |
| adev0: squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| </pre> |
| |
| <p>Get the SLURM partition and node status.</p> |
| <pre> |
| adev0: sinfo |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| debug up 00:30:00 8 idle adev[0-7] |
| batch up 12:00:00 1 down adev8 |
| 12:00:00 7 idle adev[9-15] |
| |
| </pre> |
| <p class="footer"><a href="#top">top</a></p> |
| |
| |
| |
| <h2><a name="mpi">MPI</a></h2> |
| <p>MPI use depends upon the type of MPI being used. |
| There are three fundamentally different modes of operation used |
| by these various MPI implementation. |
| <ol> |
| <li>SLURM directly launches the tasks and performs initialization |
| of communications (Quadrics MPI, MPICH2, MPICH-GM, MPICH-MX, |
| MVAPICH, MVAPICH2 and some MPICH1 modes).</li> |
| <li>SLURM creates a resource allocation for the job and then |
| mpirun launches tasks using SLURM's infrastructure (OpenMPI, |
| LAM/MPI and HP-MPI).</li> |
| <li>SLURM creates a resource allocation for the job and then |
| mpirun launches tasks using some mechanism other than SLURM, |
| such as SSH or RSH (BlueGene MPI and some MPICH1 modes). |
| These tasks initiated outside of SLURM's monitoring |
| or control. SLURM's epilog should be configured to purge |
| these tasks when the job's allocation is relinquished. </li> |
| </ol> |
| <p>Instructions for using several varieties of MPI with SLURM are |
| provided below.</p> |
| |
| <p> <a href="http://www.open-mpi.org/"><b>Open MPI</b></a> relies upon |
| SLURM to allocate resources for the job and then mpirun to initiate the |
| tasks. When using <span class="commandline">salloc</span> command, |
| <span class="commandline">mpirun</span>'s -nolocal option is recommended. |
| For example: |
| <pre> |
| $ salloc -n4 sh # allocates 4 processors and spawns shell for job |
| > mpirun -np 4 -nolocal a.out |
| > exit # exits shell spawned by initial salloc command |
| </pre> |
| <p>Note that any direct use of <span class="commandline">srun</span> |
| will only launch one task per node when the LAM/MPI plugin is used. |
| To launch more than one task per node usng the |
| <span class="commandline">srun</span> command, the <i>--mpi=none</i> |
| option will be required to explicitly disable the LAM/MPI plugin.</p> |
| |
| <p> <a href="http://www.quadrics.com/"><b>Quadrics MPI</b></a> relies upon SLURM to |
| allocate resources for the job and <span class="commandline">srun</span> |
| to initiate the tasks. One would build the MPI program in the normal manner |
| then initiate it using a command line of this sort:</p> |
| <pre> |
| $ srun [options] <program> [program args] |
| </pre> |
| |
| <p> <a href="http://www.lam-mpi.org/"><b>LAM/MPI</b></a> relies upon the SLURM |
| <span class="commandline">salloc</span> or <span class="commandline">sbatch</span> |
| command to allocate. In either case, specify |
| the maximum number of tasks required for the job. Then execute the |
| <span class="commandline">lamboot</span> command to start lamd daemons. |
| <span class="commandline">lamboot</span> utilizes SLURM's |
| <span class="commandline">srun</span> command to launch these daemons. |
| Do not directly execute the <span class="commandline">srun</span> command |
| to launch LAM/MPI tasks. For example: |
| <pre> |
| $ salloc -n16 sh # allocates 16 processors and spawns shell for job |
| > lamboot |
| > mpirun -np 16 foo args |
| 1234 foo running on adev0 (o) |
| 2345 foo running on adev1 |
| etc. |
| > lamclean |
| > lamhalt |
| > exit # exits shell spawned by initial srun command |
| </pre> |
| <p>Note that any direct use of <span class="commandline">srun</span> |
| will only launch one task per node when the LAM/MPI plugin is configured |
| as the default plugin. To launch more than one task per node usng the |
| <span class="commandline">srun</span> command, the <i>--mpi=none</i> |
| option would be required to explicitly disable the LAM/MPI plugin |
| if that is the system default.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p><a href="http://www.hp.com/go/mpi"><b>HP-MPI</b></a> uses the |
| <span class="commandline">mpirun</span> command with the <b>-srun</b> |
| option to launch jobs. For example: |
| <pre> |
| $MPI_ROOT/bin/mpirun -TCP -srun -N8 ./a.out |
| </pre></p> |
| |
| <p><a href="http://www-unix.mcs.anl.gov/mpi/mpich2/"><b>MPICH2</b></a> jobs |
| are launched using the <b>srun</b> command. Just link your program with |
| SLURM's implementation of the PMI library so that tasks can communicate |
| host and port information at startup. (The system administrator can add |
| these option to the mpicc and mpif77 commands directly, so the user will not |
| need to bother). For example: |
| <pre> |
| $ mpicc -L<path_to_slurm_lib> -lpmi ... |
| $ srun -n20 a.out |
| </pre> |
| <b>NOTES:</b> |
| <ul> |
| <li>Some MPICH2 functions are not currently supported by the PMI |
| libary integrated with SLURM</li> |
| <li>Set the environment variable <b>PMI_DEBUG</b> to a numeric value |
| of 1 or higher for the PMI libary to print debugging information</li> |
| </ul></p> |
| |
| <p><a href="http://www.myri.com/scs/download-mpichgm.html"><b>MPICH-GM</b></a> |
| jobs can be launched directly by <b>srun</b> command. |
| SLURM's <i>mpichgm</i> MPI plugin must be used to establish communications |
| between the laucnhed tasks. This can be accomplished either using the SLURM |
| configuration parameter <i>MpiDefault=mpichgm</i> in <b>slurm.conf</b> |
| or srun's <i>--mpi=mpichgm</i> option. |
| <pre> |
| $ mpicc ... |
| $ srun -n16 --mpi=mpichgm a.out |
| </pre> |
| |
| <p><a href="http://www.myri.com/scs/download-mpichmx.html"><b>MPICH-MX</b></a> |
| jobs can be launched directly by <b>srun</b> command. |
| SLURM's <i>mpichmx</i> MPI plugin must be used to establish communications |
| between the laucnhed tasks. This can be accomplished either using the SLURM |
| configuration parameter <i>MpiDefault=mpichmx</i> in <b>slurm.conf</b> |
| or srun's <i>--mpi=mpichmx</i> option. |
| <pre> |
| $ mpicc ... |
| $ srun -n16 --mpi=mpichmx a.out |
| </pre> |
| |
| <p><a href="http://nowlab.cse.ohio-state.edu/projects/mpi-iba"><b>MVAPICH</b></a> |
| jobs can be launched directly by <b>srun</b> command. |
| SLURM's <i>mvapich</i> MPI plugin must be used to establish communications |
| between the laucnhed tasks. This can be accomplished either using the SLURM |
| configuration parameter <i>MpiDefault=mvapich</i> in <b>slurm.conf</b> |
| or srun's <i>--mpi=mvapich</i> option. |
| <pre> |
| $ mpicc ... |
| $ srun -n16 --mpi=mvapich a.out |
| </pre> |
| <b>NOTE:</b> If MVAPICH is used in the shared memory model, with all tasks |
| running on a single node, then use the <i>mpich1_shmem</i> MPI plugin instead.<br> |
| <b>NOTE (for system administrators):</b> Configure |
| <i>PropagateResourceLimitsExcept=MEMLOCK</i> in <b>slurm.conf</b> and |
| start the <i>slurmd</i> daemons with an unlimited locked memory limit. |
| For more details, see |
| <a href="http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-420007.2.3">MVAPICH</a> |
| documentation for "CQ or QP Creation failure".</p> |
| |
| <p><a href="http://nowlab.cse.ohio-state.edu/projects/mpi-iba"><b>MVAPICH2</b></a> |
| jobs can be launched directly by <b>srun</b> command. |
| SLURM's <i>none</i> MPI plugin must be used to establish communications |
| between the laucnhed tasks. This can be accomplished either using the SLURM |
| configuration parameter <i>MpiDefault=none</i> in <b>slurm.conf</b> |
| or srun's <i>--mpi=none</i> option. The program must also be linked with |
| SLURM's implementation of the PMI library so that tasks can communicate |
| host and port information at startup. (The system administrator can add |
| these option to the mpicc and mpif77 commands directly, so the user will not |
| need to bother). <b>Do not use SLURM's MVAPICH plugin for MVAPICH2.</b> |
| <pre> |
| $ mpicc -L<path_to_slurm_lib> -lpmi ... |
| $ srun -n16 --mpi=none a.out |
| </pre> |
| |
| <p><a href="http://www.research.ibm.com/bluegene/"><b>BlueGene MPI</b></a> relies |
| upon SLURM to create the resource allocation and then uses the native |
| <span class="commandline">mpirun</span> command to launch tasks. |
| Build a job script containing one or more invocations of the |
| <span class="commandline">mpirun</span> command. Then submit |
| the script to SLURM using <span class="commandline">sbatch</span>. |
| For example:</p> |
| <pre> |
| $ sbatch -N512 my.script |
| </pre> |
| <p>Note that the node count specified with the <i>-N</i> option indicates |
| the base partition count. |
| See <a href="bluegene.html">BlueGene User and Administrator Guide</a> |
| for more information.</p> |
| |
| <p><a href="http://www-unix.mcs.anl.gov/mpi/mpich1/"><b>MPICH1</b></a> |
| development ceased in 2005. It is recommended that you convert to |
| MPICH2 or some other MPI implementation. |
| If you still want to use MPICH1, note that it has several different |
| programming models. If you are using the shared memory model |
| (<i>DEFAULT_DEVICE=ch_shmem</i> in the mpirun script), then initiate |
| the tasks using the <span class="commandline">srun</span> command |
| with the <i>--mpi=mpich1_shmem</i> option.</p> |
| <pre> |
| $ srun -n16 --mpi=mpich1_shmem a.out |
| </pre> |
| |
| <p>If you are using MPICH P4 (<i>DEFAULT_DEVICE=ch_p4</i> in |
| the mpirun script) and SLURM version 1.2.11 or newer, |
| then it is recommended that you apply the patch in the SLURM |
| distribution's file <i>contribs/mpich1.slurm.patch</i>. |
| Follow directions within the file to rebuild MPICH. |
| Applications must be relinked with the new library. |
| Initiate tasks using the |
| <span class="commandline">srun</span> command with the |
| <i>--mpi=mpich1_p4</i> option.</p> |
| <pre> |
| $ srun -n16 --mpi=mpich1_p4 a.out |
| </pre> |
| <p>Note that SLURM launches one task per node and the MPICH |
| library linked within your applications launches the other |
| tasks with shared memory used for communications between them. |
| The only real anomaly is that all output from all spawned tasks |
| on a node appear to SLURM as coming from the one task that it |
| launched. If the srun --label option is used, the task ID labels |
| will be misleading.</p> |
| |
| <p>Other MPICH1 programming models current rely upon the SLURM |
| <span class="commandline">salloc</span> or |
| <span class="commandline">sbatch</span> command to allocate resources. |
| In either case, specify the maximum number of tasks required for the job. |
| You may then need to build a list of hosts to be used and use that |
| as an argument to the mpirun command. |
| For example: |
| <pre> |
| $ cat mpich.sh |
| #!/bin/bash |
| srun hostname -s | sort -u >slurm.hosts |
| mpirun [options] -machinefile slurm.hosts a.out |
| rm -f slurm.hosts |
| $ sbatch -n16 mpich.sh |
| sbatch: Submitted batch job 1234 |
| </pre> |
| <p>Note that in this example, mpirun uses the rsh command to launch |
| tasks. These tasks are not managed by SLURM since they are launched |
| outside of its control.</p> |
| |
| <p style="text-align:center;">Last modified 14 August 2007</p> |
| |
| <!--#include virtual="footer.txt"--> |