| <!--#include virtual="header.txt"--> |
| |
| <h1>Topology Guide</h1> |
| |
| <p>Slurm can be configured to support topology-aware resource |
| allocation to optimize job performance. |
| Slurm supports several modes of operation, one to optimize performance on |
| systems with a three-dimensional torus interconnect and another for |
| a hierarchical interconnect. |
| The hierarchical mode of operation supports both fat-tree or dragonfly networks, |
| using slightly different algorithms.</p> |
| |
| <p>Slurm's native mode of resource selection is to consider the nodes |
| as a one-dimensional array. |
| Jobs are allocated resources on a best-fit basis. |
| For larger jobs, this minimizes the number of sets of consecutive nodes |
| allocated to the job.</p> |
| |
| <h2 id="contents">Contents |
| <a class="slurm_link" href="#topo_3d"></a> |
| </h2> |
| |
| <ul> |
| <li><a href="#topo_3d">Three-dimensional Topology</a></li> |
| <li><a href="#hierarchical">Tree Topology (Hierarchical Networks)</a> |
| <ul> |
| <li><a href="#config_generators">Configuration Generators</a></li> |
| </ul></li> |
| <li><a href="#block">Block Topology</a> |
| <ul> |
| <li><a href="#block-limitations">Limitations</a></li> |
| </ul></li> |
| <li><a href="#env_vars">Environment Variables</a></li> |
| <li><a href="#multi_topo">Multiple Topologies</a></li> |
| </ul> |
| |
| <h2 id="topo_3d">Three-dimensional Topology |
| <a class="slurm_link" href="#topo_3d"></a> |
| </h2> |
| |
| <p>Some larger computers rely upon a three-dimensional torus interconnect. |
| The Cray XT and XE systems also have three-dimensional |
| torus interconnects, but do not require that jobs execute in adjacent nodes. |
| On those systems, Slurm only needs to allocate resources to a job which |
| are nearby on the network. |
| Slurm accomplishes this using a |
| <a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert curve</a> |
| to map the nodes from a three-dimensional space into a one-dimensional |
| space. |
| Slurm's native best-fit algorithm is thus able to achieve a high degree |
| of locality for jobs.</p> |
| |
| <h2 id="hierarchical">Tree Topology (Hierarchical Networks) |
| <a class="slurm_link" href="#hierarchical"></a> |
| </h2> |
| |
| <p>Slurm can also be configured to allocate resources to jobs on a |
| hierarchical network to minimize network contention. |
| The basic algorithm is to identify the lowest level switch in the |
| hierarchy that can satisfy a job's request and then allocate resources |
| on its underlying leaf switches using a best-fit algorithm. |
| Use of this logic requires a configuration setting of |
| <i>TopologyPlugin=topology/tree</i>.</p> |
| |
| <p>Note that slurm uses a best-fit algorithm on the currently |
| available resources. This may result in an allocation with |
| more than the optimum number of switches. The user can request |
| a maximum number of leaf switches for the job as well as a |
| maximum time willing to wait for that number using the <code>--switches</code> |
| option with the salloc, sbatch and srun commands. The parameters can |
| also be changed for pending jobs using the scontrol and squeue commands.</p> |
| |
| <p>At some point in the future Slurm code may be provided to |
| gather network topology information directly. |
| Now the network topology information must be included |
| in a <i>topology.conf</i> configuration file as shown in the |
| examples below. |
| The first example describes a three level switch in which |
| each switch has two children. |
| Note that the <i>SwitchName</i> values are arbitrary and only |
| used for bookkeeping purposes, but a name must be specified on |
| each line. |
| The leaf switch descriptions contain a <i>SwitchName</i> field |
| plus a <i>Nodes</i> field to identify the nodes connected to the |
| switch. |
| Higher-level switch descriptions contain a <i>SwitchName</i> field |
| plus a <i>Switches</i> field to identify the child switches. |
| Slurm's hostlist expression parser is used, so the node and switch |
| names need not be consecutive (e.g. "Nodes=tux[0-3,12,18-20]" |
| and "Switches=s[0-2,4-8,12]" will parse fine). |
| </p> |
| |
| <p>An optional LinkSpeed option can be used to indicate the |
| relative performance of the link. |
| The units used are arbitrary and this information is currently not used. |
| It may be used in the future to optimize resource allocations.</p> |
| |
| <p>The first example shows what a topology would look like for an |
| eight node cluster in which all switches have only two children as |
| shown in the diagram (not a very realistic configuration, but |
| useful for an example).</p> |
| |
| <pre> |
| # topology.conf |
| # Switch Configuration |
| SwitchName=s0 Nodes=tux[0-1] |
| SwitchName=s1 Nodes=tux[2-3] |
| SwitchName=s2 Nodes=tux[4-5] |
| SwitchName=s3 Nodes=tux[6-7] |
| SwitchName=s4 Switches=s[0-1] |
| SwitchName=s5 Switches=s[2-3] |
| SwitchName=s6 Switches=s[4-5] |
| </pre> |
| <img src=topo_ex1.gif width=600> |
| |
| <p>The next example is for a network with two levels and |
| each switch has four connections.</p> |
| <pre> |
| # topology.conf |
| # Switch Configuration |
| SwitchName=s0 Nodes=tux[0-3] LinkSpeed=900 |
| SwitchName=s1 Nodes=tux[4-7] LinkSpeed=900 |
| SwitchName=s2 Nodes=tux[8-11] LinkSpeed=900 |
| SwitchName=s3 Nodes=tux[12-15] LinkSpeed=1800 |
| SwitchName=s4 Switches=s[0-3] LinkSpeed=1800 |
| SwitchName=s5 Switches=s[0-3] LinkSpeed=1800 |
| SwitchName=s6 Switches=s[0-3] LinkSpeed=1800 |
| SwitchName=s7 Switches=s[0-3] LinkSpeed=1800 |
| </pre> |
| <img src=topo_ex2.gif width=600> |
| |
| <p>As a practical matter, listing every switch connection |
| definitely results in a slower scheduling algorithm for Slurm |
| to optimize job placement. |
| The application performance may achieve little benefit from such optimization. |
| Listing the leaf switches with their nodes plus one top level switch |
| should result in good performance for both applications and Slurm. |
| The previous example might be configured as follows:</p> |
| <pre> |
| # topology.conf |
| # Switch Configuration |
| SwitchName=s0 Nodes=tux[0-3] |
| SwitchName=s1 Nodes=tux[4-7] |
| SwitchName=s2 Nodes=tux[8-11] |
| SwitchName=s3 Nodes=tux[12-15] |
| SwitchName=s4 Switches=s[0-3] |
| </pre> |
| |
| <p>Note that compute nodes on switches that lack a common parent switch can |
| be used, but no job will span leaf switches without a common parent |
| (unless the TopologyParam=TopoOptional option is used). |
| For example, it is legal to remove the line "SwitchName=s4 Switches=s[0-3]" |
| from the above topology.conf file. |
| In that case, no job will span more than four compute nodes on any single leaf |
| switch. |
| This configuration can be useful if one wants to schedule multiple physical |
| clusters as a single logical cluster under the control of a single slurmctld |
| daemon.</p> |
| |
| <p>If you have nodes that are in separate networks and are associated with |
| unique switches in your <b>topology.conf</b> file, it's possible that you |
| could get in a situation where a job isn't able to run. If a job requests |
| nodes that are in the different networks, either by requesting the nodes |
| directly or by requesting a feature, the job will fail because the requested |
| nodes can't communicate with each other. We recommend placing nodes in |
| separate network segments in disjoint partitions.</p> |
| |
| <p>For systems with a dragonfly network, configure Slurm with |
| <i>TopologyPlugin=topology/tree</i> plus <i>TopologyParam=dragonfly</i>. |
| If a single job can not be entirely placed within a single network leaf |
| switch, the job will be spread across as many leaf switches as possible |
| in order to optimize the job's network bandwidth.</p> |
| |
| <p><b>NOTE</b>: When using the <i>topology/tree</i> plugin, Slurm identifies |
| the network switches which provide the best fit for pending jobs. If nodes |
| have a <i>Weight</i> defined, this will override the resource selection based |
| on network topology.</p> |
| |
| <h3 id="config_generators">Configuration Generators |
| <a class="slurm_link" href="#config_generators"></a></h3> |
| |
| <p>The following independently maintained tools may be useful in generating the |
| <b>topology.conf</b> file for certain switch types:</p> |
| |
| <ul> |
| <li>Infiniband switch - <b>slurmibtopology</b><br> |
| <a href="https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology"> |
| https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology</a></li> |
| <li>Omni-Path (OPA) switch - <b>opa2slurm</b><br> |
| <a href="https://gitlab.com/jtfrey/opa2slurm"> |
| https://gitlab.com/jtfrey/opa2slurm</a></li> |
| <li>AWS Elastic Fabric Adapter (EFA) - <b>ec2-topology</b><br> |
| <a href="https://github.com/aws-samples/ec2-topology-aware-for-slurm"> |
| https://github.com/aws-samples/ec2-topology-aware-for-slurm</a></li> |
| </ul> |
| |
| <h2 id="block">Block Topology<a class="slurm_link" href="#block"></a></h2> |
| |
| <p>Slurm can be configured to allocate resources to jobs within a strictly |
| enforced, hierarchical block structure using |
| <b>TopologyPlugin=topology/block</b>. The block topology prioritizes the |
| placement of jobs to minimize fragmentation across the cluster, as opposed to |
| the tree topology, which focuses on fitting jobs on the first available |
| resources. Small jobs will still be able to use the available space in a block |
| that is partially used.</p> |
| |
| <p>The block topology approach begins with "base blocks" (bblocks), which are |
| fundamental, contiguous groups of nodes defined in |
| <a href="topology.conf.html">topology.conf</a>. |
| These base blocks can be combined with other adjacent base blocks to form |
| "aggregated blocks". In turn, these higher-level blocks can be aggregated |
| with other contiguous blocks of the same hierarchical level to construct |
| progressively larger blocks. This hierarchical arrangement is designed to |
| ensure optimized communication performance for jobs running within these blocks. |
| The <b>BlockSizes</b> configuration parameter defines the specific, enforceable |
| block sizes at each level of this hierarchy.</p> |
| |
| <p>The allocation algorithm operates as follows:</p> |
| |
| <ol> |
| <li>Identify the smallest block level, as defined by <b>BlockSizes</b>, that can |
| satisfy the job's resource request</li> |
| <li>Select a suitable subset of "lower-level blocks" (llblocks) that are |
| components of this chosen aggregating block</li> |
| <li>Allocate resources from the underlying base blocks that constitute this |
| selected subset of llblocks, employing a best-fit algorithm for the |
| precise placement of the job.</li> |
| </ol> |
| |
| <h3 id="block-limitations">Limitations |
| <a class="slurm_link" href="#block-limitations"></a> |
| </h3> |
| |
| <p>Since the block topology takes a different approach than the traditional tree |
| topology, there are limitations that should be taken into consideration.</p> |
| |
| <ul> |
| <li><b>Ranges of nodes</b><br> |
| When using <code>-N</code>/<code>--nodes</code> to specify a range of acceptable |
| node counts, the scheduler will have to evaluate each value of that range to |
| find optimal placement on the available block(s). If using a range is necessary, |
| the number of possible values should be kept as small as possible.</li> |
| <li><b>Requesting specific nodes</b><br> |
| Using <code>-w</code>/<code>--nodelist</code> to request a specific node or |
| nodes can conflict with the block placement and is not currently supported. You |
| can use <code>-x</code>/<code>--exclude</code> to prevent a job from |
| being scheduled on certain nodes. |
| </li> |
| <li><b>Contiguous blocks</b><br> |
| The scheduler will attempt to place jobs on blocks that are adjacent to each |
| other in the block structure. You cannot currently request that a job be |
| placed on non-adjacent blocks.</li> |
| </ul> |
| |
| <h2 id="user_opts">User Options<a class="slurm_link" href="#user_opts"></a></h2> |
| |
| <p>For use with the <b>topology/tree</b> plugin, user can also specify the |
| maximum number of leaf switches to be used for their job with the maximum time |
| the job should wait for this optimized configuration. The syntax for this option |
| is <code>--switches=count[@time]</code>. |
| The system administrator can limit the maximum time that any job can |
| wait for this optimized configuration using the <b>SchedulerParameters</b> |
| configuration parameter with the |
| <a href="slurm.conf.html#OPT_max_switch_wait=#">max_switch_wait</a> option.</p> |
| |
| <p>When <b>topology/tree</b> or <b>topology/block</b> is configured, hostlist |
| functions may be used in place of or alongside regular hostlist expressions |
| in commands or configuration files that interact with the slurmctld. Valid |
| topology functions include:</p> |
| |
| <ul> |
| <li><b>block{blockX}</b> and <b>switch{switchY}</b> - expand to all nodes in |
| the specified block/switch.</li> |
| <li><b>blockwith{nodeX}</b> and <b>switchwith{nodeY}</b> - expand to all nodes |
| in the same block/switch as the specified node.</li> |
| </ul> |
| |
| <p>For example:</p> |
| |
| <pre> |
| scontrol update node=block{b1} state=resume |
| sbatch --nodelist=blockwith{node0} -N 10 program |
| PartitionName=Block10 Nodes=block{block10} ... |
| </pre> |
| |
| See also the hostlist function <b>feature{myfeature}</b> |
| <a href="slurm.conf.html#OPT_Features">here</a>.</p> |
| |
| <h2 id="env_vars">Environment Variables |
| <a class="slurm_link" href="#env_vars"></a> |
| </h2> |
| |
| <p>If the topology/tree plugin is used, two environment variables will be set |
| to describe that job's network topology. Note that these environment variables |
| will contain different data for the tasks launched on each node. Use of these |
| environment variables is at the discretion of the user.</p> |
| |
| <p><b>SLURM_TOPOLOGY_ADDR</b>: |
| The value will be set to the names network switches which may be involved in |
| the job's communications from the system's top level switch down to the leaf |
| switch and ending with node name. A period is used to separate each hardware |
| component name.</p> |
| |
| <p><b>SLURM_TOPOLOGY_ADDR_PATTERN</b>: |
| This is set only if the system has the topology/tree plugin configured. |
| The value will be set component types listed in SLURM_TOPOLOGY_ADDR. |
| Each component will be identified as either "switch" or "node". |
| A period is used to separate each hardware component type.</p> |
| |
| |
| <h2 id="multi_topo">Multiple Topologies |
| <a class="slurm_link" href="#multi_topo"></a> |
| </h2> |
| |
| <p>Slurm 25.05 introduced the ability to define multiple network topologies using the |
| <a href="topology.yaml.html">topology.yaml</a> configuration file. |
| |
| Each partition can be configured to use a specific topology by specifying the |
| <a href="slurm.conf.html#OPT_Topology_1">Topology</a> |
| in its partition configuration line. |
| The Slurm controller will use the selected topology to optimize resource |
| allocation for jobs submitted to that partition. |
| If no topology is explicitly specified for a partition, |
| Slurm will default to the cluster_default topology.</p> |
| |
| <p style="text-align:center;">Last modified 31 July 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |