doc/html/elastic_computing.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1><a name="top">SLURM Elastic Computing</a></h1>

 <h2>Overview</h2>

 <p>SLURM version 2.4 has the ability to support a cluster that grows and
 shrinks on demand, typically relying upon a service such as
 <a href="http://aws.amazon.com/ec2/">Amazon Elastic Computing Cloud (Amazon EC2)</a>
 for resources.
 These resources can be combined with an existing cluster to process excess
 workload (cloud bursting) or it can operate as an independent self-contained
 cluster.
 Good responsiveness and throughput can be achieved while you only pay for the
 resources needed.</p>

 <p>The
 <a href="http://web.mit.edu/star/cluster/docs/latest/index.html">StarCluster</a>
 cloud computing toolkit has a
 <a href="https://github.com/jlafon/StarCluster">SLURM port available</a>.
 <a href="https://github.com/jlafon/StarCluster/wiki/Getting-started-with-SLURM-on-Amazon's-EC2">
 Instructions</a> for the SLURM port of StartCLuster are available online.</p>

 <p>The rest of this document describes details about SLURM's infrastructure that
 can be used to support Elastic Computing.</p>

 <p>SLURM's Elastic Computing logic relies heavily upon the existing power save
 logic.
 Review of SLURM's <a href="power_save.html">Power Saving Guide</a> is strongly
 recommended.
 This logic initiates programs when nodes are required for use and another
 program when those nodes are no longer required.
 For Elastic Computing, these programs will need to provision the resources
 from the cloud and notify SLURM of the node's name and network address and
 later relinquish the nodes back to the cloud.
 Most of the SLURM changes to support Elastic Computing were changes to
 support node addressing that can change.</p>

 <h2>SLURM Configuration</h2>

 <p>There are many ways to configure SLURM's use of resources.
 See the slurm.conf man page for more details about these options.
 Some general SLURM configuration parameters that are of interest include:
 <dl>
 <dt><b>ResumeProgram</b>
 <dd>The program executed when a node has been allocated and should be made
 available for use.
 <dt><b>SelectType</b>
 <dd>Generally must be "select/linear".
 If SLURM is configured to allocate individual CPUs to jobs rather than whole
 nodes (e.g. SelectType=select/cons_res rather than SelectType=select/linear),
 then SLURM maintains bitmaps to track the state of every CPU in the system.
 If the number of CPUs to be allocated on each node is not known when the
 slurmctld daemon is started, one must allocate whole nodes to jobs rather
 than individual processors.
 The use of "select/cons_res" requires each node to have a CPU count set and
 the node eventually selected must have at least that number of CPUs.
 <dt><b>SuspendExcNodes</b>
 <dd>Nodes not subject to suspend/resume logic. This may be used to avoid
 suspending and resuming nodes which are not in the cloud. Alternately the
 suspend/resume programs can treat local nodes differently from nodes being
 provisioned from cloud.
 <dt><b>SuspendProgram</b>
 <dd>The program executed when a node is no longer required and can be
 relinquished to the cloud.
 <dt><b>SuspendTime</b>
 <dd>The time interval that a node will be left idle before a request is made to
 relinquish it. Units are seconds.
 <dt><b>TreeWidth</b>
 <dd>Since the slurmd daemons are not aware of the network addresses of other
 nodes in the cloud, the slurmd daemons on each node should be sent messages
 directly and not forward those messages between each other. To do so,
 configure TreeWidth to a number at least as large as the maximum node count.
 The value may not exceed 65533.
 </dl>
 </p>

 <p>Some node parameters that are of interest include:
 <dl>
 <dt><b>Feature</b>
 <dd>A node feature can be associated with resources acquired from the cloud and
 user jobs can specify their preference for resource use with the "--constraint"
 option.
 <dt><b>NodeName</b>
 <dd>This is the name by which SLURM refers to the node. A name containing a
 numeric suffix is recommended for convenience. The NodeAddr and NodeHostname
 should not be set, but will be configured later using scripts.
 <dt><b>State</b>
 <dd>Nodes which are to be be added on demand should have a state of "CLOUD".
 These nodes will not actually appear in SLURM commands until after they are
 configured for use.
 <dt><b>Weight</b>
 <dd>Each node can be configured with a weight indicating the desirability of
 using that resource. Nodes with lower weights are used before those with higher
 weights.
 </dl>
 </p>

 <p>Nodes to be acquired on demand can be placed into their own SLURM partition.
 This mode of operation can be used to use these nodes only if so requested by
 the user. Note that jobs can be submitted to multiple partitions and will use
 resources from whichever partition permits faster initiation.
 A sample configuration in which nodes are added from the cloud when the workload
 exceeds available resources. Users can explicitly request local resources or
 resources from the cloud by using the "--constraint" option.
 </p>

 <pre>
 # SLURM configuration
 # Excerpt of slurm.conf
 SelectType=select/linear

 SuspendProgram=/usr/sbin/slurm_suspend
 ResumeProgram=/usr/sbin/slurm_suspend
 SuspendTime=600
 SuspendExcNodes=tux[0-127]
 TreeWidth=128

 NodeName=tux[0-127] Weight=1 Feature=local State=UNKNOWN
 NodeName=ec[0-127]  Weight=8 Feature=cloud State=CLOUD
 PartitionName=debug MaxTime=1:00:00 Nodes=tux[0-32] Default=yes
 PartitionName=batch MaxTime=8:00:00 Nodes=tux[0-127],ec[0-127] Default=no
 </pre>

 <h2>Operational Details</h2>

 <p>When the slurmctld daemon starts, all nodes with a state of CLOUD will be
 included in its internal tables, but these node records will not be seen with
 user commands or used by applications until allocated to some job. After
 allocated, the <i>ResumeProgram</i> is executed and should do the following:</p>
 <ol>
 <li>Boot the node</li>
 <li>Configure and start Munge (depends upon configuration)</li>
 <li>Install the SLURM configuration file, slurm.conf, on the node.
 Note that configuration file will generally be identical on all nodes and not
 include NodeAddr or NodeHostname configuration parameters for any nodes in the
 cloud.
 SLURM commands executed on this node only need to communicate with the
 slurmctld daemon on the ControlMachine.
 <li>Notify the slurmctld daemon of the node's hostname and network address:<br>
 <i>scontrol update nodename=ec0 nodeaddr=123.45.67.89 nodehostname=whatever</i><br>
 Note that the node address and hostname information set by the scontrol command
 are are preserved when the slurmctld daemon is restarted unless the "-c"
 (cold-start) option is used.</li>
 <li>Start the slurmd daemon on the node</li>
 </ol>

 <p>The <i>SuspendProgram</i> only needs to relinquish the node back to the
 cloud.</p>

 <p>An environment variable SLURM_NODE_ALIASES contains sets of node name,
 communication address and hostname.
 The variable is set by salloc, sbatch, and srun.
 It is then used by srun to determine the destination for job launch
 communication messages.
 This environment variable is only set for nodes allocated from the cloud.
 If a job is allocated some resources from the local cluster and others from
 the cloud, only those nodes from the cloud will appear in SLURM_NODE_ALIASES.
 Each set of names and addresses is comma separated and
 the elements within the set are separated by colons. For example:<br>
 SLURM_NODE_ALIASES=ec0:123.45.67.8:foo,ec2,123.45.67.9:bar</p>

 <h2>Remaining Work</h2>

 <ul>
 <li>We need scripts to provision resources from EC2.</li>
 <li>The SLURM_NODE_ALIASES environment varilable needs to change if a job
 expands (adds resources).</li>
 <li>Some MPI implementations will not work due to the node naming.</li>
 <li>Some tests in SLURM's test suite fail.</li>
 </ul>

 <p class="footer"><a href="#top">top</a></p>

 <p style="text-align:center;">Last modified 15 May 2012</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1><a name="top">SLURM Elastic Computing</a></h1>

	<h2>Overview</h2>

	<p>SLURM version 2.4 has the ability to support a cluster that grows and
	shrinks on demand, typically relying upon a service such as
	<a href="http://aws.amazon.com/ec2/">Amazon Elastic Computing Cloud (Amazon EC2)</a>
	for resources.
	These resources can be combined with an existing cluster to process excess
	workload (cloud bursting) or it can operate as an independent self-contained
	cluster.
	Good responsiveness and throughput can be achieved while you only pay for the
	resources needed.</p>

	<p>The
	<a href="http://web.mit.edu/star/cluster/docs/latest/index.html">StarCluster</a>
	cloud computing toolkit has a
	<a href="https://github.com/jlafon/StarCluster">SLURM port available</a>.
	<a href="https://github.com/jlafon/StarCluster/wiki/Getting-started-with-SLURM-on-Amazon's-EC2">
	Instructions</a> for the SLURM port of StartCLuster are available online.</p>

	<p>The rest of this document describes details about SLURM's infrastructure that
	can be used to support Elastic Computing.</p>

	<p>SLURM's Elastic Computing logic relies heavily upon the existing power save
	logic.
	Review of SLURM's <a href="power_save.html">Power Saving Guide</a> is strongly
	recommended.
	This logic initiates programs when nodes are required for use and another
	program when those nodes are no longer required.
	For Elastic Computing, these programs will need to provision the resources
	from the cloud and notify SLURM of the node's name and network address and
	later relinquish the nodes back to the cloud.
	Most of the SLURM changes to support Elastic Computing were changes to
	support node addressing that can change.</p>

	<h2>SLURM Configuration</h2>

	<p>There are many ways to configure SLURM's use of resources.
	See the slurm.conf man page for more details about these options.
	Some general SLURM configuration parameters that are of interest include:
	<dl>
	<dt><b>ResumeProgram</b>
	<dd>The program executed when a node has been allocated and should be made
	available for use.
	<dt><b>SelectType</b>
	<dd>Generally must be "select/linear".
	If SLURM is configured to allocate individual CPUs to jobs rather than whole
	nodes (e.g. SelectType=select/cons_res rather than SelectType=select/linear),
	then SLURM maintains bitmaps to track the state of every CPU in the system.
	If the number of CPUs to be allocated on each node is not known when the
	slurmctld daemon is started, one must allocate whole nodes to jobs rather
	than individual processors.
	The use of "select/cons_res" requires each node to have a CPU count set and
	the node eventually selected must have at least that number of CPUs.
	<dt><b>SuspendExcNodes</b>
	<dd>Nodes not subject to suspend/resume logic. This may be used to avoid
	suspending and resuming nodes which are not in the cloud. Alternately the
	suspend/resume programs can treat local nodes differently from nodes being
	provisioned from cloud.
	<dt><b>SuspendProgram</b>
	<dd>The program executed when a node is no longer required and can be
	relinquished to the cloud.
	<dt><b>SuspendTime</b>
	<dd>The time interval that a node will be left idle before a request is made to
	relinquish it. Units are seconds.
	<dt><b>TreeWidth</b>
	<dd>Since the slurmd daemons are not aware of the network addresses of other
	nodes in the cloud, the slurmd daemons on each node should be sent messages
	directly and not forward those messages between each other. To do so,
	configure TreeWidth to a number at least as large as the maximum node count.
	The value may not exceed 65533.
	</dl>
	</p>

	<p>Some node parameters that are of interest include:
	<dl>
	<dt><b>Feature</b>
	<dd>A node feature can be associated with resources acquired from the cloud and
	user jobs can specify their preference for resource use with the "--constraint"
	option.
	<dt><b>NodeName</b>
	<dd>This is the name by which SLURM refers to the node. A name containing a
	numeric suffix is recommended for convenience. The NodeAddr and NodeHostname
	should not be set, but will be configured later using scripts.
	<dt><b>State</b>
	<dd>Nodes which are to be be added on demand should have a state of "CLOUD".
	These nodes will not actually appear in SLURM commands until after they are
	configured for use.
	<dt><b>Weight</b>
	<dd>Each node can be configured with a weight indicating the desirability of
	using that resource. Nodes with lower weights are used before those with higher
	weights.
	</dl>
	</p>

	<p>Nodes to be acquired on demand can be placed into their own SLURM partition.
	This mode of operation can be used to use these nodes only if so requested by
	the user. Note that jobs can be submitted to multiple partitions and will use
	resources from whichever partition permits faster initiation.
	A sample configuration in which nodes are added from the cloud when the workload
	exceeds available resources. Users can explicitly request local resources or
	resources from the cloud by using the "--constraint" option.
	</p>

	<pre>
	# SLURM configuration
	# Excerpt of slurm.conf
	SelectType=select/linear

	SuspendProgram=/usr/sbin/slurm_suspend
	ResumeProgram=/usr/sbin/slurm_suspend
	SuspendTime=600
	SuspendExcNodes=tux[0-127]
	TreeWidth=128

	NodeName=tux[0-127] Weight=1 Feature=local State=UNKNOWN
	NodeName=ec[0-127] Weight=8 Feature=cloud State=CLOUD
	PartitionName=debug MaxTime=1:00:00 Nodes=tux[0-32] Default=yes
	PartitionName=batch MaxTime=8:00:00 Nodes=tux[0-127],ec[0-127] Default=no
	</pre>

	<h2>Operational Details</h2>

	<p>When the slurmctld daemon starts, all nodes with a state of CLOUD will be
	included in its internal tables, but these node records will not be seen with
	user commands or used by applications until allocated to some job. After
	allocated, the <i>ResumeProgram</i> is executed and should do the following:</p>
	<ol>
	<li>Boot the node</li>
	<li>Configure and start Munge (depends upon configuration)</li>
	<li>Install the SLURM configuration file, slurm.conf, on the node.
	Note that configuration file will generally be identical on all nodes and not
	include NodeAddr or NodeHostname configuration parameters for any nodes in the
	cloud.
	SLURM commands executed on this node only need to communicate with the
	slurmctld daemon on the ControlMachine.
	<li>Notify the slurmctld daemon of the node's hostname and network address:<br>
	<i>scontrol update nodename=ec0 nodeaddr=123.45.67.89 nodehostname=whatever</i><br>
	Note that the node address and hostname information set by the scontrol command
	are are preserved when the slurmctld daemon is restarted unless the "-c"
	(cold-start) option is used.</li>
	<li>Start the slurmd daemon on the node</li>
	</ol>

	<p>The <i>SuspendProgram</i> only needs to relinquish the node back to the
	cloud.</p>

	<p>An environment variable SLURM_NODE_ALIASES contains sets of node name,
	communication address and hostname.
	The variable is set by salloc, sbatch, and srun.
	It is then used by srun to determine the destination for job launch
	communication messages.
	This environment variable is only set for nodes allocated from the cloud.
	If a job is allocated some resources from the local cluster and others from
	the cloud, only those nodes from the cloud will appear in SLURM_NODE_ALIASES.
	Each set of names and addresses is comma separated and
	the elements within the set are separated by colons. For example:<br>
	SLURM_NODE_ALIASES=ec0:123.45.67.8:foo,ec2,123.45.67.9:bar</p>

	<h2>Remaining Work</h2>

	<ul>
	<li>We need scripts to provision resources from EC2.</li>
	<li>The SLURM_NODE_ALIASES environment varilable needs to change if a job
	expands (adds resources).</li>
	<li>Some MPI implementations will not work due to the node naming.</li>
	<li>Some tests in SLURM's test suite fail.</li>
	</ul>

	<p class="footer"><a href="#top">top</a></p>

	<p style="text-align:center;">Last modified 15 May 2012</p>

	<!--#include virtual="footer.txt"-->