blob: 326072c80a2d6a0d9bb8e84af2b7d7e0cf8c63c7 [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1>Topology Guide</h1>
<p>Slurm can be configured to support topology-aware resource
allocation to optimize job performance.
Slurm supports several modes of operation, one to optimize performance on
systems with a three-dimensional torus interconnect and another for
a hierarchical interconnect.
The hierarchical mode of operation supports both fat-tree or dragonfly networks,
using slightly different algorithms.</p>
<p>Slurm's native mode of resource selection is to consider the nodes
as a one-dimensional array.
Jobs are allocated resources on a best-fit basis.
For larger jobs, this minimizes the number of sets of consecutive nodes
allocated to the job.</p>
<h2 id="contents">Contents
<a class="slurm_link" href="#topo_3d"></a>
</h2>
<ul>
<li><a href="#topo_3d">Three-dimensional Topology</a></li>
<li><a href="#hierarchical">Tree Topology (Hierarchical Networks)</a>
<ul>
<li><a href="#config_generators">Configuration Generators</a></li>
</ul></li>
<li><a href="#block">Block Topology</a>
<ul>
<li><a href="#block-limitations">Limitations</a></li>
</ul></li>
<li><a href="#env_vars">Environment Variables</a></li>
<li><a href="#multi_topo">Multiple Topologies</a></li>
</ul>
<h2 id="topo_3d">Three-dimensional Topology
<a class="slurm_link" href="#topo_3d"></a>
</h2>
<p>Some larger computers rely upon a three-dimensional torus interconnect.
The Cray XT and XE systems also have three-dimensional
torus interconnects, but do not require that jobs execute in adjacent nodes.
On those systems, Slurm only needs to allocate resources to a job which
are nearby on the network.
Slurm accomplishes this using a
<a href="http://en.wikipedia.org/wiki/Hilbert_curve">Hilbert curve</a>
to map the nodes from a three-dimensional space into a one-dimensional
space.
Slurm's native best-fit algorithm is thus able to achieve a high degree
of locality for jobs.</p>
<h2 id="hierarchical">Tree Topology (Hierarchical Networks)
<a class="slurm_link" href="#hierarchical"></a>
</h2>
<p>Slurm can also be configured to allocate resources to jobs on a
hierarchical network to minimize network contention.
The basic algorithm is to identify the lowest level switch in the
hierarchy that can satisfy a job's request and then allocate resources
on its underlying leaf switches using a best-fit algorithm.
Use of this logic requires a configuration setting of
<i>TopologyPlugin=topology/tree</i>.</p>
<p>Note that slurm uses a best-fit algorithm on the currently
available resources. This may result in an allocation with
more than the optimum number of switches. The user can request
a maximum number of leaf switches for the job as well as a
maximum time willing to wait for that number using the <code>--switches</code>
option with the salloc, sbatch and srun commands. The parameters can
also be changed for pending jobs using the scontrol and squeue commands.</p>
<p>At some point in the future Slurm code may be provided to
gather network topology information directly.
Now the network topology information must be included
in a <i>topology.conf</i> configuration file as shown in the
examples below.
The first example describes a three level switch in which
each switch has two children.
Note that the <i>SwitchName</i> values are arbitrary and only
used for bookkeeping purposes, but a name must be specified on
each line.
The leaf switch descriptions contain a <i>SwitchName</i> field
plus a <i>Nodes</i> field to identify the nodes connected to the
switch.
Higher-level switch descriptions contain a <i>SwitchName</i> field
plus a <i>Switches</i> field to identify the child switches.
Slurm's hostlist expression parser is used, so the node and switch
names need not be consecutive (e.g. "Nodes=tux[0-3,12,18-20]"
and "Switches=s[0-2,4-8,12]" will parse fine).
</p>
<p>An optional LinkSpeed option can be used to indicate the
relative performance of the link.
The units used are arbitrary and this information is currently not used.
It may be used in the future to optimize resource allocations.</p>
<p>The first example shows what a topology would look like for an
eight node cluster in which all switches have only two children as
shown in the diagram (not a very realistic configuration, but
useful for an example).</p>
<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-1]
SwitchName=s1 Nodes=tux[2-3]
SwitchName=s2 Nodes=tux[4-5]
SwitchName=s3 Nodes=tux[6-7]
SwitchName=s4 Switches=s[0-1]
SwitchName=s5 Switches=s[2-3]
SwitchName=s6 Switches=s[4-5]
</pre>
<img src=topo_ex1.gif width=600>
<p>The next example is for a network with two levels and
each switch has four connections.</p>
<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-3] LinkSpeed=900
SwitchName=s1 Nodes=tux[4-7] LinkSpeed=900
SwitchName=s2 Nodes=tux[8-11] LinkSpeed=900
SwitchName=s3 Nodes=tux[12-15] LinkSpeed=1800
SwitchName=s4 Switches=s[0-3] LinkSpeed=1800
SwitchName=s5 Switches=s[0-3] LinkSpeed=1800
SwitchName=s6 Switches=s[0-3] LinkSpeed=1800
SwitchName=s7 Switches=s[0-3] LinkSpeed=1800
</pre>
<img src=topo_ex2.gif width=600>
<p>As a practical matter, listing every switch connection
definitely results in a slower scheduling algorithm for Slurm
to optimize job placement.
The application performance may achieve little benefit from such optimization.
Listing the leaf switches with their nodes plus one top level switch
should result in good performance for both applications and Slurm.
The previous example might be configured as follows:</p>
<pre>
# topology.conf
# Switch Configuration
SwitchName=s0 Nodes=tux[0-3]
SwitchName=s1 Nodes=tux[4-7]
SwitchName=s2 Nodes=tux[8-11]
SwitchName=s3 Nodes=tux[12-15]
SwitchName=s4 Switches=s[0-3]
</pre>
<p>Note that compute nodes on switches that lack a common parent switch can
be used, but no job will span leaf switches without a common parent
(unless the TopologyParam=TopoOptional option is used).
For example, it is legal to remove the line "SwitchName=s4 Switches=s[0-3]"
from the above topology.conf file.
In that case, no job will span more than four compute nodes on any single leaf
switch.
This configuration can be useful if one wants to schedule multiple physical
clusters as a single logical cluster under the control of a single slurmctld
daemon.</p>
<p>If you have nodes that are in separate networks and are associated with
unique switches in your <b>topology.conf</b> file, it's possible that you
could get in a situation where a job isn't able to run. If a job requests
nodes that are in the different networks, either by requesting the nodes
directly or by requesting a feature, the job will fail because the requested
nodes can't communicate with each other. We recommend placing nodes in
separate network segments in disjoint partitions.</p>
<p>For systems with a dragonfly network, configure Slurm with
<i>TopologyPlugin=topology/tree</i> plus <i>TopologyParam=dragonfly</i>.
If a single job can not be entirely placed within a single network leaf
switch, the job will be spread across as many leaf switches as possible
in order to optimize the job's network bandwidth.</p>
<p><b>NOTE</b>: When using the <i>topology/tree</i> plugin, Slurm identifies
the network switches which provide the best fit for pending jobs. If nodes
have a <i>Weight</i> defined, this will override the resource selection based
on network topology.</p>
<h3 id="config_generators">Configuration Generators
<a class="slurm_link" href="#config_generators"></a></h3>
<p>The following independently maintained tools may be useful in generating the
<b>topology.conf</b> file for certain switch types:</p>
<ul>
<li>Infiniband switch - <b>slurmibtopology</b><br>
<a href="https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology">
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology</a></li>
<li>Omni-Path (OPA) switch - <b>opa2slurm</b><br>
<a href="https://gitlab.com/jtfrey/opa2slurm">
https://gitlab.com/jtfrey/opa2slurm</a></li>
<li>AWS Elastic Fabric Adapter (EFA) - <b>ec2-topology</b><br>
<a href="https://github.com/aws-samples/ec2-topology-aware-for-slurm">
https://github.com/aws-samples/ec2-topology-aware-for-slurm</a></li>
</ul>
<h2 id="block">Block Topology<a class="slurm_link" href="#block"></a></h2>
<p>Slurm can be configured to allocate resources to jobs within a strictly
enforced, hierarchical block structure using
<b>TopologyPlugin=topology/block</b>. The block topology prioritizes the
placement of jobs to minimize fragmentation across the cluster, as opposed to
the tree topology, which focuses on fitting jobs on the first available
resources. Small jobs will still be able to use the available space in a block
that is partially used.</p>
<p>The block topology approach begins with "base blocks" (bblocks), which are
fundamental, contiguous groups of nodes defined in
<a href="topology.conf.html">topology.conf</a>.
These base blocks can be combined with other adjacent base blocks to form
"aggregated blocks". In turn, these higher-level blocks can be aggregated
with other contiguous blocks of the same hierarchical level to construct
progressively larger blocks. This hierarchical arrangement is designed to
ensure optimized communication performance for jobs running within these blocks.
The <b>BlockSizes</b> configuration parameter defines the specific, enforceable
block sizes at each level of this hierarchy.</p>
<p>The allocation algorithm operates as follows:</p>
<ol>
<li>Identify the smallest block level, as defined by <b>BlockSizes</b>, that can
satisfy the job's resource request</li>
<li>Select a suitable subset of "lower-level blocks" (llblocks) that are
components of this chosen aggregating block</li>
<li>Allocate resources from the underlying base blocks that constitute this
selected subset of llblocks, employing a best-fit algorithm for the
precise placement of the job.</li>
</ol>
<h3 id="block-limitations">Limitations
<a class="slurm_link" href="#block-limitations"></a>
</h3>
<p>Since the block topology takes a different approach than the traditional tree
topology, there are limitations that should be taken into consideration.</p>
<ul>
<li><b>Ranges of nodes</b><br>
When using <code>-N</code>/<code>--nodes</code> to specify a range of acceptable
node counts, the scheduler will have to evaluate each value of that range to
find optimal placement on the available block(s). If using a range is necessary,
the number of possible values should be kept as small as possible.</li>
<li><b>Requesting specific nodes</b><br>
Using <code>-w</code>/<code>--nodelist</code> to request a specific node or
nodes can conflict with the block placement and is not currently supported. You
can use <code>-x</code>/<code>--exclude</code> to prevent a job from
being scheduled on certain nodes.
</li>
<li><b>Contiguous blocks</b><br>
The scheduler will attempt to place jobs on blocks that are adjacent to each
other in the block structure. You cannot currently request that a job be
placed on non-adjacent blocks.</li>
</ul>
<h2 id="user_opts">User Options<a class="slurm_link" href="#user_opts"></a></h2>
<p>For use with the <b>topology/tree</b> plugin, user can also specify the
maximum number of leaf switches to be used for their job with the maximum time
the job should wait for this optimized configuration. The syntax for this option
is <code>--switches=count[@time]</code>.
The system administrator can limit the maximum time that any job can
wait for this optimized configuration using the <b>SchedulerParameters</b>
configuration parameter with the
<a href="slurm.conf.html#OPT_max_switch_wait=#">max_switch_wait</a> option.</p>
<p>When <b>topology/tree</b> or <b>topology/block</b> is configured, hostlist
functions may be used in place of or alongside regular hostlist expressions
in commands or configuration files that interact with the slurmctld. Valid
topology functions include:</p>
<ul>
<li><b>block{blockX}</b> and <b>switch{switchY}</b> - expand to all nodes in
the specified block/switch.</li>
<li><b>blockwith{nodeX}</b> and <b>switchwith{nodeY}</b> - expand to all nodes
in the same block/switch as the specified node.</li>
</ul>
<p>For example:</p>
<pre>
scontrol update node=block{b1} state=resume
sbatch --nodelist=blockwith{node0} -N 10 program
PartitionName=Block10 Nodes=block{block10} ...
</pre>
See also the hostlist function <b>feature{myfeature}</b>
<a href="slurm.conf.html#OPT_Features">here</a>.</p>
<h2 id="env_vars">Environment Variables
<a class="slurm_link" href="#env_vars"></a>
</h2>
<p>If the topology/tree plugin is used, two environment variables will be set
to describe that job's network topology. Note that these environment variables
will contain different data for the tasks launched on each node. Use of these
environment variables is at the discretion of the user.</p>
<p><b>SLURM_TOPOLOGY_ADDR</b>:
The value will be set to the names network switches which may be involved in
the job's communications from the system's top level switch down to the leaf
switch and ending with node name. A period is used to separate each hardware
component name.</p>
<p><b>SLURM_TOPOLOGY_ADDR_PATTERN</b>:
This is set only if the system has the topology/tree plugin configured.
The value will be set component types listed in SLURM_TOPOLOGY_ADDR.
Each component will be identified as either "switch" or "node".
A period is used to separate each hardware component type.</p>
<h2 id="multi_topo">Multiple Topologies
<a class="slurm_link" href="#multi_topo"></a>
</h2>
<p>Slurm 25.05 introduced the ability to define multiple network topologies using the
<a href="topology.yaml.html">topology.yaml</a> configuration file.
Each partition can be configured to use a specific topology by specifying the
<a href="slurm.conf.html#OPT_Topology_1">Topology</a>
in its partition configuration line.
The Slurm controller will use the selected topology to optimize resource
allocation for jobs submitted to that partition.
If no topology is explicitly specified for a partition,
Slurm will default to the cluster_default topology.</p>
<p style="text-align:center;">Last modified 31 July 2025</p>
<!--#include virtual="footer.txt"-->