| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">Overview</a></h1> |
| |
| <p>The Simple Linux Utility for Resource Management (SLURM) is an open source, |
| fault-tolerant, and highly scalable cluster management and job scheduling system |
| for large and small Linux clusters. SLURM requires no kernel modifications for |
| its operation and is relatively self-contained. As a cluster workload manager, |
| SLURM has three key functions. First, it allocates exclusive and/or non-exclusive |
| access to resources (compute nodes) to users for some duration of time so they |
| can perform work. Second, it provides a framework for starting, executing, and |
| monitoring work (normally a parallel job) on the set of allocated nodes. |
| Finally, it arbitrates contention for resources by managing a queue of |
| pending work. |
| Optional plugins can be used for |
| <a href="accounting.html">accounting</a>, |
| <a href="reservations.html">advanced reservation</a>, |
| <a href="gang_scheduling.html">gang scheduling</a> (time sharing for |
| parallel jobs), backfill scheduling, |
| <a href="topology.html">topology optimized resource selection</a>, |
| <a href="resource_limits.html">resource limits</a> by user or bank account, |
| and sophisticated <a href="priority_multifactor.html"> multifactor job |
| prioritization</a> algorithms. |
| |
| <h2>Architecture</h2> |
| <p>SLURM has a centralized manager, <b>slurmctld</b>, to monitor resources and |
| work. There may also be a backup manager to assume those responsibilities in the |
| event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which |
| can be compared to a remote shell: it waits for work, executes that work, returns |
| status, and waits for more work. |
| The <b>slurmd</b> daemons provide fault-tolerant hierarchical communications. |
| There is an optional <b>slurmdbd</b> (Slurm DataBase Daemon) which can be used |
| to record accounting information for multiple Slurm-managed clusters in a |
| single database. |
| User tools include <b>srun</b> to initiate jobs, |
| <b>scancel</b> to terminate queued or running jobs, |
| <b>sinfo</b> to report system status, |
| <b>squeue</b> to report the status of jobs, and |
| <b>sacct</b> to get information about jobs and job steps that are running or have completed. |
| The <b>smap</b> and <b>sview</b> commands graphically reports system and |
| job status including network topology. |
| There is an administrative tool <b>scontrol</b> available to monitor |
| and/or modify configuration and state information on the cluster. |
| The administrative tool used to manage the database is <b>sacctmgr</b>. |
| It can be used to identify the clusters, valid users, valid bank accounts, etc. |
| APIs are available for all functions.</p> |
| |
| <div class="figure"> |
| <img src="arch.gif" width="550"><br> |
| Figure 1. SLURM components |
| </div> |
| |
| <p>SLURM has a general-purpose plugin mechanism available to easily support various |
| infrastructures. This permits a wide variety of SLURM configurations using a |
| building block approach. These plugins presently include: |
| <ul> |
| <li><a href="accounting_storageplugins.html">Accounting Storage</a>: |
| Primarily Used to store historical data about jobs. When used with |
| SlurmDBD (Slurm Database Daemon), it can also supply a |
| limits based system along with historical system status. |
| </li> |
| |
| <li><a href="acct_gather_energy_plugins.html">Account Gather Energy</a>: |
| Gather energy comsumption data per job or nodes in the system. |
| This plugin is integrated with the |
| <a href="accounting_storageplugins.html">Accounting Storage</a> and |
| <a href="jobacct_gatherplugins.html"> Job Account Gather</a> plugins. |
| </li> |
| |
| <li><a href="authplugins.html">Authentication of communications</a>: |
| Provides authentication mechanism between various components of Slurm. |
| </li> |
| |
| <li><a href="checkpoint_plugins.html">Checkpoint</a>: |
| Interface to various checkpoint mechanisms. |
| </li> |
| |
| <li><a href="crypto_plugins.html">Cryptography (Digital Signature |
| Generation)</a>: |
| Mechanism used to generate a digital signature, which is used to validate |
| that job step is authorized to execute on specific nodes. |
| This is distinct from the plugin used for |
| <a href="authplugins.html">Authentication</a> since the job step |
| request is send from the user's srun command rather than directly from the |
| slurmctld daemon, which generates the job step credential and its |
| digitial signature. |
| </li> |
| |
| <li><a href="gres.html">Generic Resources</a>: Provide interface to |
| control generic reources like Processing Units (GPUs) and Intel® |
| Many Integrated Core (MIC) processors. |
| </li> |
| |
| <li><a href="job_submit_plugins.html">Job Submit</a>: |
| Custom plugin to allow site specific control over job requirements at |
| submission and update. |
| </li> |
| |
| <li><a href="jobacct_gatherplugins.html">Job Accounting Gather</a>: |
| Gather job step resource utilization data. |
| </li> |
| |
| <li><a href="jobcompplugins.html">Job Completion Logging</a>: |
| Log a job's termination data. This is typically a subset of data stored by |
| an <a href="accounting_storageplugins.html">Accounting Storage Plugin</a>. |
| </li> |
| |
| <li><a href="launch_plugins.html">Launchers</a>: |
| Controls the mechanism used by the <a href="srun.html">'srun'</a> command |
| to launch the tasks. |
| </li> |
| |
| <li><a href="mpiplugins.html">MPI</a>: |
| Provides different hooks for the various MPI implementations. |
| For example, this can set MPI specific environment variables. |
| </li> |
| |
| <li><a href="preempt.html">Preempt</a>: |
| Determines which jobs can preempt other jobs and the preemption mechanism |
| to be used. |
| </li> |
| |
| <li><a href="priority_plugins.html">Priority</a>: |
| Assigns priorities to jobs upon submission and on an ongoing basis |
| (e.g. as they age). |
| </li> |
| |
| <li><a href="proctrack_plugins.html">Process tracking (for signaling)</a>: |
| Provides a mechanism for identifying the processes associated with each job. |
| Used for job accounting and signalling. |
| </li> |
| |
| <li><a href="schedplugins.html">Scheduler</a>: |
| Plugin determines how and when Slurm schedules jobs. |
| </li> |
| |
| <li><a href="selectplugins.html">Node selection</a>: |
| Plugin used to determine the resources used for a job allocation. |
| </li> |
| |
| <li><a href="switchplugins.html">Switch or interconnect</a>: |
| Plugin to interface with a switch or interconnect. |
| For most systems (ethernet or infiniband) this is not needed. |
| </li> |
| |
| <li><a href="taskplugins.html">Task Affinity</a>: |
| Provides mechanism to bind a job and it's individual tasks to specific |
| processors. |
| </li> |
| |
| <li><a href="topology_plugin.html">Network Topology</a>: |
| Optimizes resource selection based upon the network topology. |
| Used for both job allocations and advanced reservation. |
| </li> |
| |
| </ul> |
| |
| <p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>, |
| the compute resource in SLURM, <b>partitions</b>, which group nodes into logical |
| sets, <b>jobs</b>, or allocations of resources assigned to a user for |
| a specified amount of time, and <b>job steps</b>, which are sets of (possibly |
| parallel) tasks within a job. |
| The partitions can be considered job queues, each of which has an assortment of |
| constraints such as job size limit, job time limit, users permitted to use it, etc. |
| Priority-ordered jobs are allocated nodes within a partition until the resources |
| (nodes, processors, memory, etc.) within that partition are exhausted. Once |
| a job is assigned a set of nodes, the user is able to initiate parallel work in |
| the form of job steps in any configuration within the allocation. For instance, |
| a single job step may be started that utilizes all nodes allocated to the job, |
| or several job steps may independently use a portion of the allocation. |
| SLURM provides resource management for the processors allocated to a job, |
| so that multiple job steps can be simultaneously submitted and queued until |
| there are available resources within the job's allocation.</p> |
| |
| <div class="figure"> |
| <img src="entities.gif" width="550"><br> |
| Figure 2. SLURM entities |
| </div> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>Configurability</h2> |
| <p>Node state monitored include: count of processors, size of real memory, size |
| of temporary disk space, and state (UP, DOWN, etc.). Additional node information |
| includes weight (preference in being allocated work) and features (arbitrary information |
| such as processor speed or type). |
| Nodes are grouped into partitions, which may contain overlapping nodes so they are |
| best thought of as job queues. |
| Partition information includes: name, list of associated nodes, state (UP or DOWN), |
| maximum job time limit, maximum node count per job, group access list, |
| priority (important if nodes are in multiple partitions) and shared node access policy |
| with optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2). |
| Bit maps are used to represent nodes and scheduling |
| decisions can be made by performing a small number of comparisons and a series |
| of fast bit map manipulations. A sample (partial) SLURM configuration file follows.</p> |
| <pre> |
| # |
| # Sample /etc/slurm.conf |
| # |
| ControlMachine=linux0001 |
| BackupController=linux0002 |
| # |
| AuthType=auth/munge |
| Epilog=/usr/local/slurm/sbin/epilog |
| PluginDir=/usr/local/slurm/lib |
| Prolog=/usr/local/slurm/sbin/prolog |
| SlurmctldPort=7002 |
| SlurmctldTimeout=120 |
| SlurmdPort=7003 |
| SlurmdSpoolDir=/var/tmp/slurmd.spool |
| SlurmdTimeout=120 |
| StateSaveLocation=/usr/local/slurm/slurm.state |
| TmpFS=/tmp |
| # |
| # Node Configurations |
| # |
| NodeName=DEFAULT CPUs=4 TmpDisk=16384 State=IDLE |
| NodeName=lx[0001-0002] State=DRAINED |
| NodeName=lx[0003-8000] RealMemory=2048 Weight=2 |
| NodeName=lx[8001-9999] RealMemory=4096 Weight=6 Feature=video |
| # |
| # Partition Configurations |
| # |
| PartitionName=DEFAULT MaxTime=30 MaxNodes=2 |
| PartitionName=login Nodes=lx[0001-0002] State=DOWN |
| PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES |
| PartitionName=class Nodes=lx[0031-0040] AllowGroups=students |
| PartitionName=DEFAULT MaxTime=UNLIMITED MaxNodes=4096 |
| PartitionName=batch Nodes=lx[0041-9999] |
| </pre> |
| |
| <p style="text-align:center;">Last modified 6 March 2013</p> |
| |
| <!--#include virtual="footer.txt"--> |