| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">Overview</a></h1> |
| |
| <p>The Simple Linux Utility for Resource Management (SLURM) is an open source, |
| fault-tolerant, and highly scalable cluster management and job scheduling system |
| for large and small Linux clusters. SLURM requires no kernel modifications for |
| its operation and is relatively self-contained. As a cluster resource manager, |
| SLURM has three key functions. First, it allocates exclusive and/or non-exclusive |
| access to resources (compute nodes) to users for some duration of time so they |
| can perform work. Second, it provides a framework for starting, executing, and |
| monitoring work (normally a parallel job) on the set of allocated nodes. |
| Finally, it arbitrates contention for resources by managing a queue of |
| pending work. |
| Optional plugins can be used for |
| <a href="accounting.html">accounting</a>, |
| <a href="reservations.html">advanced reservation</a>, |
| <a href="gang_scheduling.html">gang scheduling</a> (time sharing for |
| parallel jobs), backfill scheduling, |
| <a href="resource_limits.html">resource limits</a> by user or bank account, |
| and sophisticated <a href="priority_multifactor.html"> multifactor job |
| prioritization</a> algorithms. |
| |
| |
| <p>SLURM has been developed through the collaborative efforts of |
| <a href="https://www.llnl.gov/">Lawrence Livermore National Laboratory (LLNL)</a>, |
| <a href="http://www.hp.com/">Hewlett-Packard</a>, |
| <a href="http://www.bull.com/">Bull</a>, |
| Linux NetworX and many other contributors.</p> |
| |
| <h2>Architecture</h2> |
| <p>SLURM has a centralized manager, <b>slurmctld</b>, to monitor resources and |
| work. There may also be a backup manager to assume those responsibilities in the |
| event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which |
| can be compared to a remote shell: it waits for work, executes that work, returns |
| status, and waits for more work. |
| The <b>slurmd</b> daemons provide fault-tolerant hierarchical communications. |
| There is an optional <b>slurmdbd</b> (Slurm DataBase Daemon) which can be used |
| to record accounting information for multiple Slurm-managed clusters in a |
| single database. |
| User tools include <b>srun</b> to initiate jobs, |
| <b>scancel</b> to terminate queued or running jobs, |
| <b>sinfo</b> to report system status, |
| <b>squeue</b> to report the status of jobs, and |
| <b>sacct</b> to get information about jobs and job steps that are running or have completed. |
| The <b>smap</b> and <b>sview</b> commands graphically reports system and |
| job status including network topology. |
| There is an administrative tool <b>scontrol</b> available to monitor |
| and/or modify configuration and state information on the cluster. |
| The administrative tool used to manage the database is <b>sacctmgr</b>. |
| It can be used to identify the clusters, valid users, valid bank accounts, etc. |
| APIs are available for all functions.</p> |
| |
| <div class="figure"> |
| <img src="arch.gif" width="550"><br> |
| Figure 1. SLURM components |
| </div> |
| |
| <p>SLURM has a general-purpose plugin mechanism available to easily support various |
| infrastructures. This permits a wide variety of SLURM configurations using a |
| building block approach. These plugins presently include: |
| <ul> |
| <li><a href="accounting_storageplugins.html">Accounting Storage</a>: |
| text file (default if jobacct_gather != none), |
| MySQL, PGSQL, SlurmDBD (Slurm Database Daemon) or none</li> |
| |
| <li><a href="authplugins.html">Authentication of communications</a>: |
| <a href="http://www.theether.org/authd/">authd</a>, |
| <a href="http://home.gna.org/munge/">munge</a>, or none (default).</li> |
| |
| <li><a href="checkpoint_plugins.html">Checkpoint</a>: AIX, OpenMPI, XLCH, or none.</li> |
| |
| <li><a href="crypto_plugins.html">Cryptography (Digital Signature Generation)</a>: |
| <a href="http://home.gna.org/munge/">munge</a> (default) or |
| <a href="http://www.openssl.org/">OpenSSL</a>.</li> |
| |
| <li><a href="jobacct_gatherplugins.html">Job Accounting Gather</a>: AIX, Linux, or none(default)</li> |
| |
| <li><a href="jobcompplugins.html">Job Completion Logging</a>: |
| text file, arbitrary script, MySQL, PGSQL, SlurmDBD, or none (default).</li> |
| |
| <li><a href="mpiplugins.html">MPI</a>: LAM, MPICH1-P4, MPICH1-shmem, |
| MPICH-GM, MPICH-MX, MVAPICH, OpenMPI and none (default, for most |
| other versions of MPI including MPICH2 and MVAPICH2).</li> |
| |
| <li><a href="priority_plugins.html">Priority</a>: |
| Assigns priorities to jobs as they arrive. |
| Options include |
| <a href="priority_multifactor.html">multifactor job prioritization</a> |
| (assigns job priority based upon fair-share allocate, size, age, QoS, and/or partition) and |
| basic (assigns job priority based upon age for First In First Out ordering, default).</li> |
| |
| <li><a href="proctrack_plugins.html">Process tracking (for signaling)</a>: |
| AIX (using a kernel extension), Linux process tree hierarchy, process group ID, |
| RMS (Quadrics Linux kernel patch), |
| and <a href="http://oss.sgi.com/projects/pagg/">SGI's Process Aggregates (PAGG)</a>.</li> |
| |
| <li><a href="selectplugins.html">Node selection</a>: |
| Bluegene (a 3-D torus interconnect BGL or BGP), |
| <a href="cons_res.html">consumable resources</a> (to allocate |
| individual processors and memory) or linear (to dedicate entire nodes).</li> |
| |
| <li><a href="schedplugins.html">Scheduler</a>: |
| builtin (First In First Out, default), |
| backfill (starts jobs early if doing so does not delay the expected initiation |
| time of any higher priority job), |
| gang (time-slicing for parallel jobs), |
| <a href="http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php"> |
| The Maui Scheduler</a>, and |
| <a href="http://www.clusterresources.com/pages/products/moab-cluster-suite.php"> |
| Moab Cluster Suite</a>. |
| There is also a <a href="priority_multifactor.html">multifactor job |
| prioritization</a> plugin |
| available for use with the basic, backfill and gang schedulers only. |
| Jobs can be prioritized by age, size, fair-share allocation, etc. |
| Many <a href="resource_limits.html">resource limits</a> are also |
| configurable by user or bank account.</li> |
| |
| <li><a href="switchplugins.html">Switch or interconnect</a>: |
| <a href="http://www.quadrics.com/">Quadrics</a> |
| (Elan3 or Elan4), |
| Federation |
| <a href="http://publib-b.boulder.ibm.com/Redbooks.nsf/f338d71ccde39f08852568dd006f956d/55258945787efc2e85256db00051980a?OpenDocument">Federation</a> (IBM High Performance Switch), |
| or none (actually means nothing requiring special handling, such as Ethernet or |
| <a href="http://www.myricom.com/">Myrinet</a>, default).</li> |
| |
| <li><a href="taskplugins.html">Task Affinity</a>: |
| Affinity (bind tasks to processors or CPU sets) or none (no binding, the default).</li> |
| |
| <li><a href="topology_plugin.html">Network Topology</a>: |
| 3d_torus (optimize resource selection based upon a 3d_torus interconnect, default for Cray XT, Sun Constellation and IBM BlueGene), |
| tree (optimize resource selection based upon switch connections) or |
| none (the default).</li> |
| |
| </ul> |
| |
| <p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>, |
| the compute resource in SLURM, <b>partitions</b>, which group nodes into logical |
| sets, <b>jobs</b>, or allocations of resources assigned to a user for |
| a specified amount of time, and <b>job steps</b>, which are sets of (possibly |
| parallel) tasks within a job. |
| The partitions can be considered job queues, each of which has an assortment of |
| constraints such as job size limit, job time limit, users permitted to use it, etc. |
| Priority-ordered jobs are allocated nodes within a partition until the resources |
| (nodes, processors, memory, etc.) within that partition are exhausted. Once |
| a job is assigned a set of nodes, the user is able to initiate parallel work in |
| the form of job steps in any configuration within the allocation. For instance, |
| a single job step may be started that utilizes all nodes allocated to the job, |
| or several job steps may independently use a portion of the allocation. |
| SLURM provides resource management for the processors allocated to a job, |
| so that multiple job steps can be simultaneously submitted and queued until |
| there are available resources within the job's allocation.</p> |
| |
| <div class="figure"> |
| <img src="entities.gif" width="550"><br> |
| Figure 2. SLURM entities |
| </div> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>Configurability</h2> |
| <p>Node state monitored include: count of processors, size of real memory, size |
| of temporary disk space, and state (UP, DOWN, etc.). Additional node information |
| includes weight (preference in being allocated work) and features (arbitrary information |
| such as processor speed or type). |
| Nodes are grouped into partitions, which may contain overlapping nodes so they are |
| best thought of as job queues. |
| Partition information includes: name, list of associated nodes, state (UP or DOWN), |
| maximum job time limit, maximum node count per job, group access list, |
| priority (important if nodes are in multiple partitions) and shared node access policy |
| with optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2). |
| Bit maps are used to represent nodes and scheduling |
| decisions can be made by performing a small number of comparisons and a series |
| of fast bit map manipulations. A sample (partial) SLURM configuration file follows.</p> |
| <pre> |
| # |
| # Sample /etc/slurm.conf |
| # |
| ControlMachine=linux0001 |
| BackupController=linux0002 |
| # |
| AuthType=auth/munge |
| Epilog=/usr/local/slurm/sbin/epilog |
| PluginDir=/usr/local/slurm/lib |
| Prolog=/usr/local/slurm/sbin/prolog |
| SlurmctldPort=7002 |
| SlurmctldTimeout=120 |
| SlurmdPort=7003 |
| SlurmdSpoolDir=/var/tmp/slurmd.spool |
| SlurmdTimeout=120 |
| StateSaveLocation=/usr/local/slurm/slurm.state |
| TmpFS=/tmp |
| # |
| # Node Configurations |
| # |
| NodeName=DEFAULT Procs=4 TmpDisk=16384 State=IDLE |
| NodeName=lx[0001-0002] State=DRAINED |
| NodeName=lx[0003-8000] RealMemory=2048 Weight=2 |
| NodeName=lx[8001-9999] RealMemory=4096 Weight=6 Feature=video |
| # |
| # Partition Configurations |
| # |
| PartitionName=DEFAULT MaxTime=30 MaxNodes=2 |
| PartitionName=login Nodes=lx[0001-0002] State=DOWN |
| PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES |
| PartitionName=class Nodes=lx[0031-0040] AllowGroups=students |
| PartitionName=DEFAULT MaxTime=UNLIMITED MaxNodes=4096 |
| PartitionName=batch Nodes=lx[0041-9999] |
| </pre> |
| |
| <p style="text-align:center;">Last modified 31 March 2009</p> |
| |
| <!--#include virtual="footer.txt"--> |