blob: 3e38d901205864c0b3ddc0f6d9d05ccd8f4bc47d [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1><a name="top">Generic Resource (GRES) Design Guide</a></h1>
<h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2>
<p>Generic Resources (GRES) are resources associated with a specific node
that can be allocated to jobs and steps. The most obvious example of
GRES use would be GPUs. GRES are identified by a specific name and use an
optional plugin to provide device-specific support. This document is meant
to provide details about Slurm's implementation of GRES support including the
relevant data structures. For an overview of GRES configuration and use, see
<a href="gres.html">Generic Resource (GRES) Scheduling</a>.
<h2 id="data">Data Structures<a class="slurm_link" href="#data"></a></h2>
<p>GRES are associated with Slurm nodes, jobs and job steps. You will find
a string variable named <b>gres</b> in those data structures which
is used to store the GRES configured on a node or required by a job or step
(e.g. "gpu:2,nic:1"). This string is also visible to various Slurm commands
viewing information about those data structures (e.g. "scontrol show job").
There is a second variable associated with each of those data structures on
the <b>slurmctld</b> daemon
named <b>gres_list</b> that is intended for program use only. Each element
in the list <b>gres_list</b> provides information about a specific GRES type
(e.g. one data structure for "gpu" and a second structure with information
about "nic"). The structures on <b>gres_list</b> contain an ID number
(which is faster to compare than a string) plus a pointer to another structure.
This second structure differs somewhat for nodes, jobs, and steps (see
<b>gres_node_state_t</b>, <b>gres_job_state_t</b>, and <b>gres_step_state_t</b> in
<b>src/common/gres.h</b> for details), but contains various counters and bitmaps.
Since these data structures differ for various entity types, the functions
used to work with them are also different. If no GRES are associated with a
node, job or step, then both <b>gres</b> and <b>gres_list</b> will be NULL.</p>
<pre>
------------------------
| Job Information |
|----------------------|
| gres = "gpu:2,nic:1" |
| gres_list |
------------------------
|
+---------------------------------
| |
------------------ ------------------
| List Struct | | List Struct |
|----------------| |----------------|
| id = 123 (gpu) | | id = 124 (nic) |
| gres_data | | gres_data |
------------------ ------------------
| |
| ....
|
|
------------------------------------------------
| gres_job_state_t |
|----------------------------------------------|
| gres_count = 2 |
| node_count = 3 |
| gres_bitmap(by node) = 0,1; |
| 2,3; |
| 0,2 |
| gres_count_allocated_to_steps(by node) = 1; |
| 1; |
| 1 |
| gres_bitmap_allocated_to_steps(by node) = 0; |
| 2; |
| 0 |
------------------------------------------------
</pre>
<h2 id="op">Mode of Operation<a class="slurm_link" href="#op"></a></h2>
<p>After the slurmd daemon reads the configuration files, it calls the function
<b>node_config_load()</b> for each configured plugin. This can be used to
validate the configuration, for example validate that the appropriate devices
actually exist. If no GRES plugin exists for that resource type, the information
in the configuration file is assumed correct. Each node's GRES information is
reported by slurmd to the slurmctld daemon at node registration time.</p>
<p>The slurmctld daemon maintains GRES information in the data structures
described above for each node, including the number of configured and allocated
resources. If those resources are identified with a specific device file
rather than just a count, bitmaps are used record which specific resources have
been allocated to jobs.</p>
<p>The slurmctld daemon's GRES information about jobs includes several arrays
equal in length to the number of allocated nodes. The index into each of the
arrays is the sequence number of the node in that's job's allocation (e.g.
the first element is node zero of the <b>job</b> allocation). The job step's
GRES information is similar to that of a job including the design where the
index into arrays is based upon the job's allocation. This means when a job
step is allocated or terminates, the required bitmap operations are very
easy to perform without computing different index values for job and step
data structures.</p>
<p>The most complex operation on the GRES data structures happens when a job
changes size (has nodes added or removed). In that case, the array indexed by
node index must be rebuilt, with records shifting as appropriate. Note that
the current software is not compatible with having different GRES counts by
node (a job can not have 2 GPUs on one node and 1 GPU on a second node),
although that might be addressed at a later time.</p>
<p>When a job or step is initiated, its credential includes allocated GRES information.
This can be used by the slurmd daemon to associate those resources with that
job. Our plan is to use the Linux cgroups logic to bind a job and/or its
tasks with specific GRES devices, however that logic does not currently exist.
What does exist today is a pair of plugin APIs, <b>job_set_env()</b> and
<b>step_set_env()</b> which can be used to set environment variables for the
program directing it to GRES which have been allocated for its use (the CUDA
libraries base their GPU selection upon environment variables, so this logic
should work for CUDA today if users do not attempt to manipulate the
environment variables reserved for CUDA use).</p>
<p>If you want to see how GRES logic is allocating resources, configure
<b>DebugFlags=GRES</b> to log GRES state changes. Note the resulting output can
be quite verbose, especially for larger clusters.</p>
<p style="text-align:center;">Last modified 6 August 2021</p>
<!--#include virtual="footer.txt"-->