blob: 9c3a692a6f14f7ee20f7ebebb581db0b1fa3138a [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1><a name="top">SLURM Elastic Computing</a></h1>
<h2>Overview</h2>
<p>SLURM version 2.4 has the ability to support a cluster that grows and
shrinks on demand, typically relying upon a service such as
<a href="http://aws.amazon.com/ec2/">Amazon Elastic Computing Cloud (Amazon EC2)</a>
for resources.
These resources can be combined with an existing cluster to process excess
workload (cloud bursting) or it can operate as an independent self-contained
cluster.
Good responsiveness and throughput can be achieved while you only pay for the
resources needed.</p>
<p>The
<a href="http://web.mit.edu/star/cluster/docs/latest/index.html">StarCluster</a>
cloud computing toolkit has a
<a href="https://github.com/jlafon/StarCluster">SLURM port available</a>.
<a href="https://github.com/jlafon/StarCluster/wiki/Getting-started-with-SLURM-on-Amazon's-EC2">
Instructions</a> for the SLURM port of StartCLuster are available online.</p>
<p>The rest of this document describes details about SLURM's infrastructure that
can be used to support Elastic Computing.</p>
<p>SLURM's Elastic Computing logic relies heavily upon the existing power save
logic.
Review of SLURM's <a href="power_save.html">Power Saving Guide</a> is strongly
recommended.
This logic initiates programs when nodes are required for use and another
program when those nodes are no longer required.
For Elastic Computing, these programs will need to provision the resources
from the cloud and notify SLURM of the node's name and network address and
later relinquish the nodes back to the cloud.
Most of the SLURM changes to support Elastic Computing were changes to
support node addressing that can change.</p>
<h2>SLURM Configuration</h2>
<p>There are many ways to configure SLURM's use of resources.
See the slurm.conf man page for more details about these options.
Some general SLURM configuration parameters that are of interest include:
<dl>
<dt><b>ResumeProgram</b>
<dd>The program executed when a node has been allocated and should be made
available for use.
<dt><b>SelectType</b>
<dd>Generally must be "select/linear".
If SLURM is configured to allocate individual CPUs to jobs rather than whole
nodes (e.g. SelectType=select/cons_res rather than SelectType=select/linear),
then SLURM maintains bitmaps to track the state of every CPU in the system.
If the number of CPUs to be allocated on each node is not known when the
slurmctld daemon is started, one must allocate whole nodes to jobs rather
than individual processors.
The use of "select/cons_res" requires each node to have a CPU count set and
the node eventually selected must have at least that number of CPUs.
<dt><b>SuspendExcNodes</b>
<dd>Nodes not subject to suspend/resume logic. This may be used to avoid
suspending and resuming nodes which are not in the cloud. Alternately the
suspend/resume programs can treat local nodes differently from nodes being
provisioned from cloud.
<dt><b>SuspendProgram</b>
<dd>The program executed when a node is no longer required and can be
relinquished to the cloud.
<dt><b>SuspendTime</b>
<dd>The time interval that a node will be left idle before a request is made to
relinquish it. Units are seconds.
<dt><b>TreeWidth</b>
<dd>Since the slurmd daemons are not aware of the network addresses of other
nodes in the cloud, the slurmd daemons on each node should be sent messages
directly and not forward those messages between each other. To do so,
configure TreeWidth to a number at least as large as the maximum node count.
The value may not exceed 65533.
</dl>
</p>
<p>Some node parameters that are of interest include:
<dl>
<dt><b>Feature</b>
<dd>A node feature can be associated with resources acquired from the cloud and
user jobs can specify their preference for resource use with the "--constraint"
option.
<dt><b>NodeName</b>
<dd>This is the name by which SLURM refers to the node. A name containing a
numeric suffix is recommended for convenience. The NodeAddr and NodeHostname
should not be set, but will be configured later using scripts.
<dt><b>State</b>
<dd>Nodes which are to be be added on demand should have a state of "CLOUD".
These nodes will not actually appear in SLURM commands until after they are
configured for use.
<dt><b>Weight</b>
<dd>Each node can be configured with a weight indicating the desirability of
using that resource. Nodes with lower weights are used before those with higher
weights.
</dl>
</p>
<p>Nodes to be acquired on demand can be placed into their own SLURM partition.
This mode of operation can be used to use these nodes only if so requested by
the user. Note that jobs can be submitted to multiple partitions and will use
resources from whichever partition permits faster initiation.
A sample configuration in which nodes are added from the cloud when the workload
exceeds available resources. Users can explicitly request local resources or
resources from the cloud by using the "--constraint" option.
</p>
<pre>
# SLURM configuration
# Excerpt of slurm.conf
SelectType=select/linear
SuspendProgram=/usr/sbin/slurm_suspend
ResumeProgram=/usr/sbin/slurm_suspend
SuspendTime=600
SuspendExcNodes=tux[0-127]
TreeWidth=128
NodeName=tux[0-127] Weight=1 Feature=local State=UNKNOWN
NodeName=ec[0-127] Weight=8 Feature=cloud State=CLOUD
PartitionName=debug MaxTime=1:00:00 Nodes=tux[0-32] Default=yes
PartitionName=batch MaxTime=8:00:00 Nodes=tux[0-127],ec[0-127] Default=no
</pre>
<h2>Operational Details</h2>
<p>When the slurmctld daemon starts, all nodes with a state of CLOUD will be
included in its internal tables, but these node records will not be seen with
user commands or used by applications until allocated to some job. After
allocated, the <i>ResumeProgram</i> is executed and should do the following:</p>
<ol>
<li>Boot the node</li>
<li>Configure and start Munge (depends upon configuration)</li>
<li>Install the SLURM configuration file, slurm.conf, on the node.
Note that configuration file will generally be identical on all nodes and not
include NodeAddr or NodeHostname configuration parameters for any nodes in the
cloud.
SLURM commands executed on this node only need to communicate with the
slurmctld daemon on the ControlMachine.
<li>Notify the slurmctld daemon of the node's hostname and network address:<br>
<i>scontrol update nodename=ec0 nodeaddr=123.45.67.89 nodehostname=whatever</i><br>
Note that the node address and hostname information set by the scontrol command
are are preserved when the slurmctld daemon is restarted unless the "-c"
(cold-start) option is used.</li>
<li>Start the slurmd daemon on the node</li>
</ol>
<p>The <i>SuspendProgram</i> only needs to relinquish the node back to the
cloud.</p>
<p>An environment variable SLURM_NODE_ALIASES contains sets of node name,
communication address and hostname.
The variable is set by salloc, sbatch, and srun.
It is then used by srun to determine the destination for job launch
communication messages.
This environment variable is only set for nodes allocated from the cloud.
If a job is allocated some resources from the local cluster and others from
the cloud, only those nodes from the cloud will appear in SLURM_NODE_ALIASES.
Each set of names and addresses is comma separated and
the elements within the set are separated by colons. For example:<br>
SLURM_NODE_ALIASES=ec0:123.45.67.8:foo,ec2,123.45.67.9:bar</p>
<h2>Remaining Work</h2>
<ul>
<li>We need scripts to provision resources from EC2.</li>
<li>The SLURM_NODE_ALIASES environment varilable needs to change if a job
expands (adds resources).</li>
<li>Some MPI implementations will not work due to the node naming.</li>
<li>Some tests in SLURM's test suite fail.</li>
</ul>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 15 May 2012</p>
<!--#include virtual="footer.txt"-->