| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">SLURM Elastic Computing</a></h1> |
| |
| <h2>Overview</h2> |
| |
| <p>SLURM version 2.4 has the ability to support a cluster that grows and |
| shrinks on demand, typically relying upon a service such as |
| <a href="http://aws.amazon.com/ec2/">Amazon Elastic Computing Cloud (Amazon EC2)</a> |
| for resources. |
| These resources can be combined with an existing cluster to process excess |
| workload (cloud bursting) or it can operate as an independent self-contained |
| cluster. |
| Good responsiveness and throughput can be achieved while you only pay for the |
| resources needed.</p> |
| |
| <p>The |
| <a href="http://web.mit.edu/star/cluster/docs/latest/index.html">StarCluster</a> |
| cloud computing toolkit has a |
| <a href="https://github.com/jlafon/StarCluster">SLURM port available</a>. |
| <a href="https://github.com/jlafon/StarCluster/wiki/Getting-started-with-SLURM-on-Amazon's-EC2"> |
| Instructions</a> for the SLURM port of StartCLuster are available online.</p> |
| |
| <p>The rest of this document describes details about SLURM's infrastructure that |
| can be used to support Elastic Computing.</p> |
| |
| <p>SLURM's Elastic Computing logic relies heavily upon the existing power save |
| logic. |
| Review of SLURM's <a href="power_save.html">Power Saving Guide</a> is strongly |
| recommended. |
| This logic initiates programs when nodes are required for use and another |
| program when those nodes are no longer required. |
| For Elastic Computing, these programs will need to provision the resources |
| from the cloud and notify SLURM of the node's name and network address and |
| later relinquish the nodes back to the cloud. |
| Most of the SLURM changes to support Elastic Computing were changes to |
| support node addressing that can change.</p> |
| |
| <h2>SLURM Configuration</h2> |
| |
| <p>There are many ways to configure SLURM's use of resources. |
| See the slurm.conf man page for more details about these options. |
| Some general SLURM configuration parameters that are of interest include: |
| <dl> |
| <dt><b>ResumeProgram</b> |
| <dd>The program executed when a node has been allocated and should be made |
| available for use. |
| <dt><b>SelectType</b> |
| <dd>Generally must be "select/linear". |
| If SLURM is configured to allocate individual CPUs to jobs rather than whole |
| nodes (e.g. SelectType=select/cons_res rather than SelectType=select/linear), |
| then SLURM maintains bitmaps to track the state of every CPU in the system. |
| If the number of CPUs to be allocated on each node is not known when the |
| slurmctld daemon is started, one must allocate whole nodes to jobs rather |
| than individual processors. |
| The use of "select/cons_res" requires each node to have a CPU count set and |
| the node eventually selected must have at least that number of CPUs. |
| <dt><b>SuspendExcNodes</b> |
| <dd>Nodes not subject to suspend/resume logic. This may be used to avoid |
| suspending and resuming nodes which are not in the cloud. Alternately the |
| suspend/resume programs can treat local nodes differently from nodes being |
| provisioned from cloud. |
| <dt><b>SuspendProgram</b> |
| <dd>The program executed when a node is no longer required and can be |
| relinquished to the cloud. |
| <dt><b>SuspendTime</b> |
| <dd>The time interval that a node will be left idle before a request is made to |
| relinquish it. Units are seconds. |
| <dt><b>TreeWidth</b> |
| <dd>Since the slurmd daemons are not aware of the network addresses of other |
| nodes in the cloud, the slurmd daemons on each node should be sent messages |
| directly and not forward those messages between each other. To do so, |
| configure TreeWidth to a number at least as large as the maximum node count. |
| The value may not exceed 65533. |
| </dl> |
| </p> |
| |
| <p>Some node parameters that are of interest include: |
| <dl> |
| <dt><b>Feature</b> |
| <dd>A node feature can be associated with resources acquired from the cloud and |
| user jobs can specify their preference for resource use with the "--constraint" |
| option. |
| <dt><b>NodeName</b> |
| <dd>This is the name by which SLURM refers to the node. A name containing a |
| numeric suffix is recommended for convenience. The NodeAddr and NodeHostname |
| should not be set, but will be configured later using scripts. |
| <dt><b>State</b> |
| <dd>Nodes which are to be be added on demand should have a state of "CLOUD". |
| These nodes will not actually appear in SLURM commands until after they are |
| configured for use. |
| <dt><b>Weight</b> |
| <dd>Each node can be configured with a weight indicating the desirability of |
| using that resource. Nodes with lower weights are used before those with higher |
| weights. |
| </dl> |
| </p> |
| |
| <p>Nodes to be acquired on demand can be placed into their own SLURM partition. |
| This mode of operation can be used to use these nodes only if so requested by |
| the user. Note that jobs can be submitted to multiple partitions and will use |
| resources from whichever partition permits faster initiation. |
| A sample configuration in which nodes are added from the cloud when the workload |
| exceeds available resources. Users can explicitly request local resources or |
| resources from the cloud by using the "--constraint" option. |
| </p> |
| |
| <pre> |
| # SLURM configuration |
| # Excerpt of slurm.conf |
| SelectType=select/linear |
| |
| SuspendProgram=/usr/sbin/slurm_suspend |
| ResumeProgram=/usr/sbin/slurm_suspend |
| SuspendTime=600 |
| SuspendExcNodes=tux[0-127] |
| TreeWidth=128 |
| |
| NodeName=tux[0-127] Weight=1 Feature=local State=UNKNOWN |
| NodeName=ec[0-127] Weight=8 Feature=cloud State=CLOUD |
| PartitionName=debug MaxTime=1:00:00 Nodes=tux[0-32] Default=yes |
| PartitionName=batch MaxTime=8:00:00 Nodes=tux[0-127],ec[0-127] Default=no |
| </pre> |
| |
| <h2>Operational Details</h2> |
| |
| <p>When the slurmctld daemon starts, all nodes with a state of CLOUD will be |
| included in its internal tables, but these node records will not be seen with |
| user commands or used by applications until allocated to some job. After |
| allocated, the <i>ResumeProgram</i> is executed and should do the following:</p> |
| <ol> |
| <li>Boot the node</li> |
| <li>Configure and start Munge (depends upon configuration)</li> |
| <li>Install the SLURM configuration file, slurm.conf, on the node. |
| Note that configuration file will generally be identical on all nodes and not |
| include NodeAddr or NodeHostname configuration parameters for any nodes in the |
| cloud. |
| SLURM commands executed on this node only need to communicate with the |
| slurmctld daemon on the ControlMachine. |
| <li>Notify the slurmctld daemon of the node's hostname and network address:<br> |
| <i>scontrol update nodename=ec0 nodeaddr=123.45.67.89 nodehostname=whatever</i><br> |
| Note that the node address and hostname information set by the scontrol command |
| are are preserved when the slurmctld daemon is restarted unless the "-c" |
| (cold-start) option is used.</li> |
| <li>Start the slurmd daemon on the node</li> |
| </ol> |
| |
| <p>The <i>SuspendProgram</i> only needs to relinquish the node back to the |
| cloud.</p> |
| |
| <p>An environment variable SLURM_NODE_ALIASES contains sets of node name, |
| communication address and hostname. |
| The variable is set by salloc, sbatch, and srun. |
| It is then used by srun to determine the destination for job launch |
| communication messages. |
| This environment variable is only set for nodes allocated from the cloud. |
| If a job is allocated some resources from the local cluster and others from |
| the cloud, only those nodes from the cloud will appear in SLURM_NODE_ALIASES. |
| Each set of names and addresses is comma separated and |
| the elements within the set are separated by colons. For example:<br> |
| SLURM_NODE_ALIASES=ec0:123.45.67.8:foo,ec2,123.45.67.9:bar</p> |
| |
| <h2>Remaining Work</h2> |
| |
| <ul> |
| <li>We need scripts to provision resources from EC2.</li> |
| <li>The SLURM_NODE_ALIASES environment varilable needs to change if a job |
| expands (adds resources).</li> |
| <li>Some MPI implementations will not work due to the node naming.</li> |
| <li>Some tests in SLURM's test suite fail.</li> |
| </ul> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p style="text-align:center;">Last modified 15 May 2012</p> |
| |
| <!--#include virtual="footer.txt"--> |