doc/html/overview.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1><a name="top">Overview</a></h1>

 <p>The Simple Linux Utility for Resource Management (SLURM) is an open source,
 fault-tolerant, and highly scalable cluster management and job scheduling system
 for large and small Linux clusters. SLURM requires no kernel modifications for
 its operation and is relatively self-contained. As a cluster resource manager,
 SLURM has three key functions. First, it allocates exclusive and/or non-exclusive
 access to resources (compute nodes) to users for some duration of time so they
 can perform work. Second, it provides a framework for starting, executing, and
 monitoring work (normally a parallel job) on the set of allocated nodes.
 Finally, it arbitrates contention for resources by managing a queue of
 pending work.
 Optional plugins can be used for
 <a href="accounting.html">accounting</a>,
 <a href="reservations.html">advanced reservation</a>,
 <a href="gang_scheduling.html">gang scheduling</a> (time sharing for
 parallel jobs), backfill scheduling,
 <a href="resource_limits.html">resource limits</a> by user or bank account,
 and sophisticated <a href="priority_multifactor.html"> multifactor job
   prioritization</a> algorithms.


 <p>SLURM has been developed through the collaborative efforts of
 <a href="https://www.llnl.gov/">Lawrence Livermore National Laboratory (LLNL)</a>,
 <a href="http://www.hp.com/">Hewlett-Packard</a>,
 <a href="http://www.bull.com/">Bull</a>,
 Linux NetworX and many other contributors.</p>

 <h2>Architecture</h2>
 <p>SLURM has a centralized manager, <b>slurmctld</b>, to monitor resources and
 work. There may also be a backup manager to assume those responsibilities in the
 event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which
 can be compared to a remote shell: it waits for work, executes that work, returns
 status, and waits for more work.
 The <b>slurmd</b> daemons provide fault-tolerant hierarchical communications.
 There is an optional <b>slurmdbd</b> (Slurm DataBase Daemon) which can be used
 to record accounting information for multiple Slurm-managed clusters in a
 single database.
 User tools include <b>srun</b> to initiate jobs,
 <b>scancel</b> to terminate queued or running jobs,
 <b>sinfo</b> to report system status,
 <b>squeue</b> to report the status of jobs, and
 <b>sacct</b> to get information about jobs and job steps that are running or have completed.
 The <b>smap</b> and <b>sview</b> commands graphically reports system and
 job status including network topology.
 There is an administrative tool <b>scontrol</b> available to monitor
 and/or modify configuration and state information on the cluster.
 The administrative tool used to manage the database is <b>sacctmgr</b>.
 It can be used to identify the clusters, valid users, valid bank accounts, etc.
 APIs are available for all functions.</p>

 <div class="figure">
   <img src="arch.gif" width="550"><br>
   Figure 1. SLURM components
 </div>

 <p>SLURM has a general-purpose plugin mechanism available to easily support various
 infrastructures. This permits a wide variety of SLURM configurations using a
 building block approach. These plugins presently include:
 <ul>
 <li><a href="accounting_storageplugins.html">Accounting Storage</a>:
 text file (default if jobacct_gather != none),
 MySQL, PGSQL, SlurmDBD (Slurm Database Daemon) or none</li>

 <li><a href="authplugins.html">Authentication of communications</a>:
 <a href="http://www.theether.org/authd/">authd</a>,
 <a href="http://home.gna.org/munge/">munge</a>, or none (default).</li>

 <li><a href="checkpoint_plugins.html">Checkpoint</a>: AIX, OpenMPI, XLCH, or none.</li>

 <li><a href="crypto_plugins.html">Cryptography (Digital Signature Generation)</a>:
 <a href="http://home.gna.org/munge/">munge</a> (default) or
 <a href="http://www.openssl.org/">OpenSSL</a>.</li>

 <li><a href="jobacct_gatherplugins.html">Job Accounting Gather</a>: AIX, Linux, or none(default)</li>

 <li><a href="jobcompplugins.html">Job Completion Logging</a>:
 text file, arbitrary script, MySQL, PGSQL, SlurmDBD, or none (default).</li>

 <li><a href="mpiplugins.html">MPI</a>: LAM, MPICH1-P4, MPICH1-shmem,
 MPICH-GM, MPICH-MX, MVAPICH, OpenMPI and none (default, for most
 other versions of MPI including MPICH2 and MVAPICH2).</li>

 <li><a href="priority_plugins.html">Priority</a>:
 Assigns priorities to jobs as they arrive.
 Options include
 <a href="priority_multifactor.html">multifactor job prioritization</a>
 (assigns job priority based upon fair-share allocate, size, age, QoS, and/or partition) and
 basic (assigns job priority based upon age for First In First Out ordering, default).</li>

 <li><a href="proctrack_plugins.html">Process tracking (for signaling)</a>:
 AIX (using a kernel extension), Linux process tree hierarchy, process group ID,
 RMS (Quadrics Linux kernel patch),
 and <a href="http://oss.sgi.com/projects/pagg/">SGI's Process Aggregates (PAGG)</a>.</li>

 <li><a href="selectplugins.html">Node selection</a>:
 Bluegene (a 3-D torus interconnect BGL or BGP),
 <a href="cons_res.html">consumable resources</a> (to allocate
 individual processors and memory) or linear (to dedicate entire nodes).</li>

 <li><a href="schedplugins.html">Scheduler</a>:
 builtin (First In First Out, default),
 backfill (starts jobs early if doing so does not delay the expected initiation
 time of any higher priority job),
 gang (time-slicing for parallel jobs),
 <a href="http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php">
 The Maui Scheduler</a>, and
 <a href="http://www.clusterresources.com/pages/products/moab-cluster-suite.php">
 Moab Cluster Suite</a>.
 There is also a <a href="priority_multifactor.html">multifactor job
 prioritization</a> plugin
 available for use with the basic, backfill and gang schedulers only.
 Jobs can be prioritized by age, size, fair-share allocation, etc.
 Many <a href="resource_limits.html">resource limits</a> are also
 configurable by user or bank account.</li>

 <li><a href="switchplugins.html">Switch or interconnect</a>:
 <a href="http://www.quadrics.com/">Quadrics</a>
 (Elan3 or Elan4),
 Federation
 <a href="http://publib-b.boulder.ibm.com/Redbooks.nsf/f338d71ccde39f08852568dd006f956d/55258945787efc2e85256db00051980a?OpenDocument">Federation</a> (IBM High Performance Switch),
 or none (actually means nothing requiring special handling, such as Ethernet or
 <a href="http://www.myricom.com/">Myrinet</a>, default).</li>

 <li><a href="taskplugins.html">Task Affinity</a>:
 Affinity (bind tasks to processors or CPU sets) or none (no binding, the default).</li>

 <li><a href="topology_plugin.html">Network Topology</a>:
 3d_torus (optimize resource selection based upon a 3d_torus interconnect, default for Cray XT, Sun Constellation and IBM BlueGene),
 tree (optimize resource selection based upon switch connections) or
 none (the default).</li>

 </ul>

 <p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>,
 the compute resource in SLURM, <b>partitions</b>, which group nodes into logical
 sets, <b>jobs</b>, or allocations of resources assigned to a user for
 a specified amount of time, and <b>job steps</b>, which are sets of (possibly
 parallel) tasks within a job.
 The partitions can be considered job queues, each of which has an assortment of
 constraints such as job size limit, job time limit, users permitted to use it, etc.
 Priority-ordered jobs are allocated nodes within a partition until the resources
 (nodes, processors, memory, etc.) within that partition are exhausted. Once
 a job is assigned a set of nodes, the user is able to initiate parallel work in
 the form of job steps in any configuration within the allocation. For instance,
 a single job step may be started that utilizes all nodes allocated to the job,
 or several job steps may independently use a portion of the allocation.
 SLURM provides resource management for the processors allocated to a job,
 so that multiple job steps can be simultaneously submitted and queued until
 there are available resources within the job's allocation.</p>

 <div class="figure">
   <img src="entities.gif" width="550"><br>
   Figure 2. SLURM entities
 </div>

 <p class="footer"><a href="#top">top</a></p>

 <h2>Configurability</h2>
 <p>Node state monitored include: count of processors, size of real memory, size
 of temporary disk space, and state (UP, DOWN, etc.). Additional node information
 includes weight (preference in being allocated work) and features (arbitrary information
 such as processor speed or type).
 Nodes are grouped into partitions, which may contain overlapping nodes so they are
 best thought of as job queues.
 Partition information includes: name, list of associated nodes, state (UP or DOWN),
 maximum job time limit, maximum node count per job, group access list,
 priority (important if nodes are in multiple partitions) and shared node access policy
 with optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2).
 Bit maps are used to represent nodes and scheduling
 decisions can be made by performing a small number of comparisons and a series
 of fast bit map manipulations. A sample (partial) SLURM configuration file follows.</p>
 <pre>
 #
 # Sample /etc/slurm.conf
 #
 ControlMachine=linux0001
 BackupController=linux0002
 #
 AuthType=auth/munge
 Epilog=/usr/local/slurm/sbin/epilog
 PluginDir=/usr/local/slurm/lib
 Prolog=/usr/local/slurm/sbin/prolog
 SlurmctldPort=7002
 SlurmctldTimeout=120
 SlurmdPort=7003
 SlurmdSpoolDir=/var/tmp/slurmd.spool
 SlurmdTimeout=120
 StateSaveLocation=/usr/local/slurm/slurm.state
 TmpFS=/tmp
 #
 # Node Configurations
 #
 NodeName=DEFAULT Procs=4 TmpDisk=16384 State=IDLE
 NodeName=lx[0001-0002] State=DRAINED
 NodeName=lx[0003-8000] RealMemory=2048 Weight=2
 NodeName=lx[8001-9999] RealMemory=4096 Weight=6 Feature=video
 #
 # Partition Configurations
 #
 PartitionName=DEFAULT MaxTime=30 MaxNodes=2
 PartitionName=login Nodes=lx[0001-0002] State=DOWN
 PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
 PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
 PartitionName=DEFAULT MaxTime=UNLIMITED MaxNodes=4096
 PartitionName=batch Nodes=lx[0041-9999]
 </pre>

 <p style="text-align:center;">Last modified 31 March 2009</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1><a name="top">Overview</a></h1>

	<p>The Simple Linux Utility for Resource Management (SLURM) is an open source,
	fault-tolerant, and highly scalable cluster management and job scheduling system
	for large and small Linux clusters. SLURM requires no kernel modifications for
	its operation and is relatively self-contained. As a cluster resource manager,
	SLURM has three key functions. First, it allocates exclusive and/or non-exclusive
	access to resources (compute nodes) to users for some duration of time so they
	can perform work. Second, it provides a framework for starting, executing, and
	monitoring work (normally a parallel job) on the set of allocated nodes.
	Finally, it arbitrates contention for resources by managing a queue of
	pending work.
	Optional plugins can be used for
	<a href="accounting.html">accounting</a>,
	<a href="reservations.html">advanced reservation</a>,
	<a href="gang_scheduling.html">gang scheduling</a> (time sharing for
	parallel jobs), backfill scheduling,
	<a href="resource_limits.html">resource limits</a> by user or bank account,
	and sophisticated <a href="priority_multifactor.html"> multifactor job
	prioritization</a> algorithms.


	<p>SLURM has been developed through the collaborative efforts of
	<a href="https://www.llnl.gov/">Lawrence Livermore National Laboratory (LLNL)</a>,
	<a href="http://www.hp.com/">Hewlett-Packard</a>,
	<a href="http://www.bull.com/">Bull</a>,
	Linux NetworX and many other contributors.</p>

	<h2>Architecture</h2>
	<p>SLURM has a centralized manager, <b>slurmctld</b>, to monitor resources and
	work. There may also be a backup manager to assume those responsibilities in the
	event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which
	can be compared to a remote shell: it waits for work, executes that work, returns
	status, and waits for more work.
	The <b>slurmd</b> daemons provide fault-tolerant hierarchical communications.
	There is an optional <b>slurmdbd</b> (Slurm DataBase Daemon) which can be used
	to record accounting information for multiple Slurm-managed clusters in a
	single database.
	User tools include <b>srun</b> to initiate jobs,
	<b>scancel</b> to terminate queued or running jobs,
	<b>sinfo</b> to report system status,
	<b>squeue</b> to report the status of jobs, and
	<b>sacct</b> to get information about jobs and job steps that are running or have completed.
	The <b>smap</b> and <b>sview</b> commands graphically reports system and
	job status including network topology.
	There is an administrative tool <b>scontrol</b> available to monitor
	and/or modify configuration and state information on the cluster.
	The administrative tool used to manage the database is <b>sacctmgr</b>.
	It can be used to identify the clusters, valid users, valid bank accounts, etc.
	APIs are available for all functions.</p>

	<div class="figure">
	<img src="arch.gif" width="550"><br>
	Figure 1. SLURM components
	</div>

	<p>SLURM has a general-purpose plugin mechanism available to easily support various
	infrastructures. This permits a wide variety of SLURM configurations using a
	building block approach. These plugins presently include:
	<ul>
	<li><a href="accounting_storageplugins.html">Accounting Storage</a>:
	text file (default if jobacct_gather != none),
	MySQL, PGSQL, SlurmDBD (Slurm Database Daemon) or none</li>

	<li><a href="authplugins.html">Authentication of communications</a>:
	<a href="http://www.theether.org/authd/">authd</a>,
	<a href="http://home.gna.org/munge/">munge</a>, or none (default).</li>

	<li><a href="checkpoint_plugins.html">Checkpoint</a>: AIX, OpenMPI, XLCH, or none.</li>

	<li><a href="crypto_plugins.html">Cryptography (Digital Signature Generation)</a>:
	<a href="http://home.gna.org/munge/">munge</a> (default) or
	<a href="http://www.openssl.org/">OpenSSL</a>.</li>

	<li><a href="jobacct_gatherplugins.html">Job Accounting Gather</a>: AIX, Linux, or none(default)</li>

	<li><a href="jobcompplugins.html">Job Completion Logging</a>:
	text file, arbitrary script, MySQL, PGSQL, SlurmDBD, or none (default).</li>

	<li><a href="mpiplugins.html">MPI</a>: LAM, MPICH1-P4, MPICH1-shmem,
	MPICH-GM, MPICH-MX, MVAPICH, OpenMPI and none (default, for most
	other versions of MPI including MPICH2 and MVAPICH2).</li>

	<li><a href="priority_plugins.html">Priority</a>:
	Assigns priorities to jobs as they arrive.
	Options include
	<a href="priority_multifactor.html">multifactor job prioritization</a>
	(assigns job priority based upon fair-share allocate, size, age, QoS, and/or partition) and
	basic (assigns job priority based upon age for First In First Out ordering, default).</li>

	<li><a href="proctrack_plugins.html">Process tracking (for signaling)</a>:
	AIX (using a kernel extension), Linux process tree hierarchy, process group ID,
	RMS (Quadrics Linux kernel patch),
	and <a href="http://oss.sgi.com/projects/pagg/">SGI's Process Aggregates (PAGG)</a>.</li>

	<li><a href="selectplugins.html">Node selection</a>:
	Bluegene (a 3-D torus interconnect BGL or BGP),
	<a href="cons_res.html">consumable resources</a> (to allocate
	individual processors and memory) or linear (to dedicate entire nodes).</li>

	<li><a href="schedplugins.html">Scheduler</a>:
	builtin (First In First Out, default),
	backfill (starts jobs early if doing so does not delay the expected initiation
	time of any higher priority job),
	gang (time-slicing for parallel jobs),
	<a href="http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php">
	The Maui Scheduler</a>, and
	<a href="http://www.clusterresources.com/pages/products/moab-cluster-suite.php">
	Moab Cluster Suite</a>.
	There is also a <a href="priority_multifactor.html">multifactor job
	prioritization</a> plugin
	available for use with the basic, backfill and gang schedulers only.
	Jobs can be prioritized by age, size, fair-share allocation, etc.
	Many <a href="resource_limits.html">resource limits</a> are also
	configurable by user or bank account.</li>

	<li><a href="switchplugins.html">Switch or interconnect</a>:
	<a href="http://www.quadrics.com/">Quadrics</a>
	(Elan3 or Elan4),
	Federation
	<a href="http://publib-b.boulder.ibm.com/Redbooks.nsf/f338d71ccde39f08852568dd006f956d/55258945787efc2e85256db00051980a?OpenDocument">Federation</a> (IBM High Performance Switch),
	or none (actually means nothing requiring special handling, such as Ethernet or
	<a href="http://www.myricom.com/">Myrinet</a>, default).</li>

	<li><a href="taskplugins.html">Task Affinity</a>:
	Affinity (bind tasks to processors or CPU sets) or none (no binding, the default).</li>

	<li><a href="topology_plugin.html">Network Topology</a>:
	3d_torus (optimize resource selection based upon a 3d_torus interconnect, default for Cray XT, Sun Constellation and IBM BlueGene),
	tree (optimize resource selection based upon switch connections) or
	none (the default).</li>

	</ul>

	<p>The entities managed by these SLURM daemons, shown in Figure 2, include <b>nodes</b>,
	the compute resource in SLURM, <b>partitions</b>, which group nodes into logical
	sets, <b>jobs</b>, or allocations of resources assigned to a user for
	a specified amount of time, and <b>job steps</b>, which are sets of (possibly
	parallel) tasks within a job.
	The partitions can be considered job queues, each of which has an assortment of
	constraints such as job size limit, job time limit, users permitted to use it, etc.
	Priority-ordered jobs are allocated nodes within a partition until the resources
	(nodes, processors, memory, etc.) within that partition are exhausted. Once
	a job is assigned a set of nodes, the user is able to initiate parallel work in
	the form of job steps in any configuration within the allocation. For instance,
	a single job step may be started that utilizes all nodes allocated to the job,
	or several job steps may independently use a portion of the allocation.
	SLURM provides resource management for the processors allocated to a job,
	so that multiple job steps can be simultaneously submitted and queued until
	there are available resources within the job's allocation.</p>

	<div class="figure">
	<img src="entities.gif" width="550"><br>
	Figure 2. SLURM entities
	</div>

	<p class="footer"><a href="#top">top</a></p>

	<h2>Configurability</h2>
	<p>Node state monitored include: count of processors, size of real memory, size
	of temporary disk space, and state (UP, DOWN, etc.). Additional node information
	includes weight (preference in being allocated work) and features (arbitrary information
	such as processor speed or type).
	Nodes are grouped into partitions, which may contain overlapping nodes so they are
	best thought of as job queues.
	Partition information includes: name, list of associated nodes, state (UP or DOWN),
	maximum job time limit, maximum node count per job, group access list,
	priority (important if nodes are in multiple partitions) and shared node access policy
	with optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2).
	Bit maps are used to represent nodes and scheduling
	decisions can be made by performing a small number of comparisons and a series
	of fast bit map manipulations. A sample (partial) SLURM configuration file follows.</p>
	<pre>
	#
	# Sample /etc/slurm.conf
	#
	ControlMachine=linux0001
	BackupController=linux0002
	#
	AuthType=auth/munge
	Epilog=/usr/local/slurm/sbin/epilog
	PluginDir=/usr/local/slurm/lib
	Prolog=/usr/local/slurm/sbin/prolog
	SlurmctldPort=7002
	SlurmctldTimeout=120
	SlurmdPort=7003
	SlurmdSpoolDir=/var/tmp/slurmd.spool
	SlurmdTimeout=120
	StateSaveLocation=/usr/local/slurm/slurm.state
	TmpFS=/tmp
	#
	# Node Configurations
	#
	NodeName=DEFAULT Procs=4 TmpDisk=16384 State=IDLE
	NodeName=lx[0001-0002] State=DRAINED
	NodeName=lx[0003-8000] RealMemory=2048 Weight=2
	NodeName=lx[8001-9999] RealMemory=4096 Weight=6 Feature=video
	#
	# Partition Configurations
	#
	PartitionName=DEFAULT MaxTime=30 MaxNodes=2
	PartitionName=login Nodes=lx[0001-0002] State=DOWN
	PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
	PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
	PartitionName=DEFAULT MaxTime=UNLIMITED MaxNodes=4096
	PartitionName=batch Nodes=lx[0041-9999]
	</pre>

	<p style="text-align:center;">Last modified 31 March 2009</p>

	<!--#include virtual="footer.txt"-->