blob: 2daecde8cb446e3180687e015906f8de1b87889d [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1>Hierarchical Resource (HRES) Scheduling</h1>
<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
<p>Hierarchical Resources (HRES) were added in Slurm 25.05 in a <b>Beta</b>
state. While in beta, the configuration file format and other aspects of this
feature may undergo substantial changes in future versions and require major
changes to avoid errors. Sites that utilize this feature while in a beta state
should pay particularly close attention to changes noted in RELEASE_NOTES and
the CHANGELOGs when <a href="upgrades.html">upgrading</a>.</p>
<p>Hierarchical Resources allow for <a href="licenses.html">license</a>-like
resources to be defined as part of independent hierarchical topologies and
associated with specific nodes. Jobs may request any integer count of that
resource. Multiple resources may be defined in the configuration (in a
<a href="resources.yaml.html">resources.yaml</a> file), and will use
independently defined hierarchies.</p>
<p>Although these hierarchical topologies have some similarities to
<a href="topology.html">network topologies</a>, the definitions for each are
completely separate.</p>
<h2 id="modes">Planning Modes<a class="slurm_link" href="#modes"></a></h2>
<p>Three modes of resource planning are provided. In either case, a layer may be
defined with a count of zero or infinite (<code>count: -1</code>) resources,
with the impact of this depending on the mode used.</p>
<h3 id="mode1">Mode 1<a class="slurm_link" href="#mode1"></a></h3>
<p>In <b>Mode 1</b>, sufficient resources are only required on one layer that
overlaps with job allocation. In many cases only a single level is needed.
However, multiple levels may be defined if some additional flexibility in where
resources are used is preferred, as described in the example below.</p>
<p>A layer with a <b>zero</b> count of a resource will have no impact on
scheduling. Since that layer will never satisfy the request, it will never alter
the calculated list of nodes eligible to run a given job. It would be
semantically equivalent to removing that layer definition.</p>
<p>A layer with an <b>infinite</b> count of a resource will the always allow a
new job allocation to succeed. Unless prevented due to other requirements, jobs
will be able to execute immediately.</p>
<p>For example, consider the <b>natural</b> resource defined in the example
<a href="#resources_yaml">resources.yaml</a> file below. A job that requests
this resource and is allocated <code>node[01-04]</code> will pull from the pool
of 100 available to <code>node[01-16]</code>. With a count of 50 specified at
the higher level (<code>node[01-32]</code>), they can be allocated in addition
to the resources specified in each group of 16. That is, 100 each would be
available to each group of 16, and 50 more would be available on any of them,
for a total of 250 of this resource.</p>
<h3 id="mode2">Mode 2<a class="slurm_link" href="#mode2"></a></h3>
<p>In <b>Mode 2</b>, sufficient resources must be available on all layers on all
levels that overlap with job allocation. Note that scheduling stalls are
possible if higher levels do not make enough resources available.</p>
<p>A layer with a <b>zero</b> count of a resource will have a significant impact
on scheduling. All nodes underneath that layer would never be able to satisfy
the allocation constraints. This could be used to mark that resource unavailable
for a specific portion of the cluster, e.g., for hardware maintenance, without
altering the individual lower-layer resource counts.</p>
<p>A layer with an <b>infinite</b> count of a resource will have no impact on
scheduling. Since that layer will always satisfy the request, it will never
constrain execution and would be semantically equivalent to removing that layer
definition.</p>
<p>For example, consider the <b>flat</b> resource defined in the example
<a href="#resources_yaml">resources.yaml</a> file below. A job that requests
this resource and is allocated <code>node[01-04]</code> will pull from all
levels that include those nodes. So it would pull from the 12 available on
<code>node[01-08]</code>, from the 16 available on <code>node[01-16]</code>, and
from the 24 available on <code>node[01-32]</code>. Since the highest level only
contains 24, that is the maximum that can be allocated, even though a larger
total is available on lower levels. This example allows a fixed set of resources
to have some flexibility in where they are used.</p>
<h3 id="mode3">Mode 3<a class="slurm_link" href="#mode3"></a></h3>
<p>In <b>Mode 3</b>, resources are allocated on a per-node basis,
with the total consumption being calculated from the most granular layer and
summed up into the successively higher layers.
This mode is specifically designed for consumable resources like power,
where the total usage at a higher level (e.g., cluster-wide)
is the sum of the usage of its components (e.g., chassis level,
then rack level summing multiple chassis, then datacenter row ).</p>
<p>The scheduler ensures that the total summed resources at every layer
in the hierarchy do not exceed that layer's defined count.</p>
<p><b>Important Note:</b> Layers defined under <b>Mode 3</b> must be able to be
represented as a <b>rooted uniform depth tree</b>.
<p>For example, consider the <b>power</b> resource defined in the example
<a href="#resources_yaml">resources.yaml</a> file below. A job that requests
1000 of the <b>power</b> resource and is allocated <code>node[01-16]</code>
will pull from all levels that include those nodes.
So it would pull 8000 from <code>node[01-08]</code>,
8000 from <code>node[09-16]</code>, 16000 from <code>node[01-16]</code>, and
16000 from <code>node[01-32]</code>.</p>
<h2 id="limitations">Limitations<a class="slurm_link" href="#limitations"></a></h2>
<ul>
<li>The resource counts cannot be dynamically changed; all changes must be made
through <b>resources.yaml</b> and applied with a restart or reconfigure</li>
<li>Dynamic nodes are not supported</li>
<li>This implementation is in a <b>beta</b> state (see
<a href="#overview">overview</a>)</li>
<li>Resource names defined through the hierarchical resources configuration must
not conflict with any cluster licenses</li>
</ul>
<h2 id="examples">Examples<a class="slurm_link" href="#examples"></a></h2>
<p>After defining a resource hierarchy in <b>resources.yaml</b> (see below for
example), you can interact with these resources in several ways, using the
syntax already used for licenses:</p>
<ul>
<li>View HRES<pre>scontrol show license</pre></li>
<li>Reserve HRES
<pre>scontrol create reservation account=root licenses=natural(node[01-016]):30,flat(node[09-12]):4,flat(node[29-32]):2</pre>
Note that the layer must be specified for each reserved HRES, identified by the
node list present available through that layer</li>
<li>Request HRES on a job
<pre>sbatch --license=flat:2,natural:1 my_script.sh</pre></li>
</ul>
<h3 id="resources_yaml">resources.yaml<a class="slurm_link" href="#resources_yaml"></a></h3>
<pre>
---
- resource: flat
mode: MODE_2
layers:
- nodes: # highest level
- "node[01-32]"
count: 24
- nodes: # middle levels
- "node[01-16]"
count: 16
- nodes:
- "node[17-32]"
count: 16
- nodes: # lowest levels
- "node[01-08]"
count: 12
- nodes:
- "node[09-16]"
count: 12
- nodes:
- "node[17-24]"
count: 12
- nodes:
- "node[25-32]"
count: 12
- resource: natural
mode: MODE_1
layers:
- nodes: # highest level
- "node[01-32]"
count: 50
- nodes: # lowest levels
- "node[01-16]"
count: 100
- nodes:
- "node[17-32]"
count: 100
- resource: power
mode: MODE_3
topology: blok1
variables:
- name: full_node
value: 1000
- name: full_gpu_node
value: 2000
layers:
- nodes:
- "node[01-08]"
count: 40000
base:
- name: storage
value: 5000
- nodes:
- "node[17-24]"
count: 40000
- nodes:
- "node[09-16]"
count: 40000
- nodes:
- "node[25-32]"
count: 40000
- nodes:
- "node[01-16]"
count: 60000
base:
- name: network1
value: 3000
- nodes:
- "node[17-32]"
count: 80000
base:
- name: network2
value: 2000
- nodes:
- "node[01-32]"
count: 130000
base:
- name: acUnit1
value: 10000
- name: acUnit2
value: 8000
</pre>
<!--#include virtual="footer.txt"-->