| <!--#include virtual="header.txt"--> |
| |
| <h1>Topology</h1> |
| |
| <p>SLURM version 2.0 can be configured to support topology-aware resource |
| allocation to optimize job performance. |
| There are two primary modes of operation, one to optimize performance on |
| systems with a three-dimensional torus interconnect and another for |
| a hierarchical interconnect.</p> |
| |
| <p>SLURM's native mode of resource selection is to consider the nodes |
| as a one-dimensional array. |
| Jobs are allocated resources on a best-fit basis. |
| For larger jobs, this minimizes the number of sets of consecutive nodes |
| allocated to the job.</p> |
| |
| <h2>Three-dimension Topology</h2> |
| |
| <p>Some larger computers rely upon a three-dimensional torus interconnect. |
| The IBM BlueGene computers is one example of this which has highly |
| constrained resource allocation scheme, essentially requiring that |
| jobs be allocated a set of nodes logically having a rectangular shape. |
| SLURM has a plugin specifically written for BlueGene to select appropriate |
| nodes for jobs, change network switch routing, boot nodes, etc as described |
| in the <a href="bluegene.html">BlueGene User and Administrator Guide</a>.</p> |
| |
| <p>The Sun Constellation and Cray XT systems also have three-dimensional |
| torus interconnects, but do not require that jobs execute in adjacent nodes. |
| On those systems, SLURM only needs to allocate resources to a job which |
| are nearby on the network. |
| SLURM accomplishes this using a |
| <a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert curve</a> |
| to map the nodes from a three-dimensional space into a one-dimensional |
| space. |
| SLURM's native best-fit algorithm is thus able to achieve a high degree |
| of locality for jobs. |
| For more information, see SLURM's documentation for |
| <a href="sun_const.html">Sun Constellation</a> and |
| <a href="cray.html">Cray XT</a> systems.</p> |
| |
| <h2>Hierarchical Networks</h2> |
| |
| <p>SLURM can also be configured to allocate resources to jobs on a |
| hierarchical network to minimize network contention. |
| The basic algorithm is to identify the lowest level switch in the |
| hierarchy that can satisfy a job's request and then allocate resources |
| on its underlying leaf switches using a best-fit algorithm. |
| Use of this logic requires a configuration setting of |
| <i>TopologyPlugin=topology/tree</i>.</p> |
| |
| <p>At some point in the future SLURM code may be provided to |
| gather network topology information directly. |
| Now the network topology information must be included |
| in a <i>topology.conf</i> configuration file as shown in the |
| examples below. |
| The first example describes a three level switch in which |
| each switch has two children. |
| Note that the <i>SwitchName</i> values are arbitrary and only |
| used to bookkeeping purposes, but a name must be specified on |
| each line. |
| The leaf switch descriptions contain a <i>SwitchName</i> field |
| plus a <i>Nodes</i> field to identify the nodes connected to the |
| switch. |
| Higher-level switch descriptions contain a <i>SwitchName</i> field |
| plus a <i>Switches</i> field to identify the child switches. |
| SLURM's hostlist expression parser is used, so the node and switch |
| names need not be consecutive (e.g. "Nodes=tux[0-3,12,18-20]" |
| and "Swithces=s[0-2,4-8,12]" will parse fine). |
| </p> |
| |
| <p>An optional LinkSpeed option can be used to indicate the |
| relative performance of the link. |
| The units used are arbitrary and this information is currently not used. |
| It may be used in the future to optimize resource allocations.</p> |
| |
| <p>The first example shows what a topology would look like for an |
| eight node cluster in which all switches have only two children as |
| shown in the diagram (not a very realistic configuration, but |
| useful for an example).</p> |
| |
| <pre> |
| # topology.conf |
| # Switch Configuration |
| SwitchName=s0 Nodes=tux[0-1] |
| SwitchName=s1 Nodes=tux[2-3] |
| SwitchName=s2 Nodes=tux[4-5] |
| SwitchName=s3 Nodes=tux[6-7] |
| SwitchName=s4 Switches=s[0-1] |
| SwitchName=s5 Switches=s[2-3] |
| SwitchName=s6 Switches=s[4-5] |
| </pre> |
| <img src=topo_ex1.gif width=600> |
| |
| <p>The next example is for a network with two levels and |
| each switch has four connections.</p> |
| <pre> |
| # topology.conf |
| # Switch Configuration |
| SwitchName=s0 Nodes=tux[0-3] LinkSpeed=900 |
| SwitchName=s1 Nodes=tux[4-7] LinkSpeed=900 |
| SwitchName=s2 Nodes=tux[8-11] LinkSpeed=900 |
| SwitchName=s3 Nodes=tux[12-15] LinkSpeed=1800 |
| SwitchName=s4 Switches=s[0-3] LinkSpeed=1800 |
| SwitchName=s5 Switches=s[0-3] LinkSpeed=1800 |
| SwitchName=s6 Switches=s[0-3] LinkSpeed=1800 |
| SwitchName=s7 Switches=s[0-3] LinkSpeed=1800 |
| </pre> |
| <img src=topo_ex2.gif width=600> |
| |
| <p style="text-align:center;">Last modified 24 March 2009</p> |
| |
| <!--#include virtual="footer.txt"--> |