blob: 9ca4d8f8edd3a0b2537cbbabc9e16eadbacb40eb [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1>Generic Resource (GRES) Scheduling</h1>
<h2 id="contents">Contents<a class="slurm_link" href="#contents"></a></h2>
<ul>
<li><a href="#Overview">Overview</a></li>
<li><a href="#Configuration">Configuration</a></li>
<li><a href="#Running_Jobs">Running Jobs</a></li>
<li><a href="#AutoDetect">AutoDetect</a></li>
<li><a href="#Accounting">Accounting</a></li>
<li><a href="#GPU_Management">GPU Management</a></li>
<li><a href="#MPS_Management">MPS Management</a></li>
<li><a href="#MIG_Management">MIG Management</a></li>
<li><a href="#Sharding">Sharding</a></li>
</ul>
<h2 id="Overview">Overview<a class="slurm_link" href="#Overview"></a></h2>
<p>Slurm supports the ability to define and schedule arbitrary Generic RESources
(GRES). Additional built-in features are enabled for specific GRES types,
including Graphics Processing Units (GPUs), CUDA Multi-Process Service (MPS)
devices, and Sharding through an extensible plugin mechanism.</p>
<h2 id="Configuration">Configuration
<a class="slurm_link" href="#Configuration"></a>
</h2>
<P>By default, no GRES are enabled in the cluster's configuration.
You must explicitly specify which GRES are to be managed in the
<I>slurm.conf</I> configuration file. The configuration parameters of
interest are <B>GresTypes</B> and <B>Gres</B>.
</P>
<P>
For more details, see <a href="slurm.conf.html#OPT_GresTypes">GresTypes</a> and <a href="slurm.conf.html#OPT_Gres_1">Gres</a> in the <I>slurm.conf</I> man page.
</P>
<P>Note that the GRES specification for each node works in the same fashion
as the other resources managed. Nodes which are found to have fewer resources
than configured will be placed in a DRAIN state.</P>
<P>Snippet from an example <I>slurm.conf</I> file:</P>
<PRE>
# Configure four GPUs (with MPS), plus bandwidth
GresTypes=gpu,mps,bandwidth
NodeName=tux[0-7] Gres=gpu:tesla:2,gpu:kepler:2,mps:400,bandwidth:lustre:no_consume:4G
</PRE>
<P>Each compute node with generic resources typically contain a <I>gres.conf</I>
file describing which resources are available on the node, their count,
associated device files and cores which should be used with those resources.</P>
<p>There are cases where you may want to define a Generic Resource on a node
without specifying a quantity of that GRES. For example, the filesystem type
of a node doesn't decrease in value as jobs run on that node.
You can use the <b>no_consume</b> flag to allow users to request a GRES
without having a defined count that gets used as it is requested.</p>
<P>
To view available <I>gres.conf</I> configuration parameters, see the
<a href="gres.conf.html">gres.conf man page</a>.</P>
<h2 id="Running_Jobs">Running Jobs
<a class="slurm_link" href="#Running_Jobs"></a>
</h2>
<P>Jobs will not be allocated any generic resources unless specifically
requested at job submit time using the options:</P>
<DL>
<DT><I>--gres</I></DT>
<DD>Generic resources required per node</DD>
<DT><I>--gpus</I></DT>
<DD>GPUs required per job</DD>
<DT><I>--gpus-per-node</I></DT>
<DD>GPUs required per node. Equivalent to the <I>--gres</I> option for GPUs.</DD>
<DT><I>--gpus-per-socket</I></DT>
<DD>GPUs required per socket. Requires the job to specify a task socket.</DD>
<DT><I>--gpus-per-task</I></DT>
<DD>GPUs required per task. Requires the job to specify a task count.</DD>
</DL>
<P>All of these options are supported by the <I>salloc</I>, <I>sbatch</I> and
<I>srun</I> commands.
Note that all of the <I>--gpu*</I> options are only supported by Slurm's
select/cons_tres plugin.
Jobs requesting these options when the select/cons_tres plugin is <U>not</U>
configured will be rejected.
The <I>--gres</I> option requires an argument specifying which generic resources
are required and how many resources using the form <I>name[:type:count]</I>
while all of the <I>--gpu*</I> options require an argument of the form
<I>[type]:count</I>.
The <I>name</I> is the same name as
specified by the <I>GresTypes</I> and <I>Gres</I> configuration parameters.
<I>type</I> identifies a specific type of that generic resource (e.g. a
specific model of GPU).
<I>count</I> specifies how many resources are required and has a default
value of 1. For example:<BR>
<I>sbatch --gres=gpu:kepler:2 ...</I>.</P>
<p>Requests for typed vs non-typed generic resources must be consistent
within a job. For example, if you request <i>--gres=gpu:2</i> with
<b>sbatch</b>, you would not be able to request <i>--gres=gpu:tesla:2</i>
with <b>srun</b> to create a job step. The same holds true in reverse,
if you request a typed GPU to create a job allocation, you should request
a GPU of the same type to create a job step.</p>
<P>Several additional resource requirement specifications are available
specifically for GPUs and detailed descriptions about these options are
available in the man pages for the job submission commands.
As for the <I>--gpu*</I> option, these options are only supported by Slurm's
select/cons_tres plugin.</P>
</P>
<DL>
<DT><I>--cpus-per-gpu</I></DT>
<DD>Count of CPUs allocated per GPU.</DD>
<DT><I>--gpu-bind</I></DT>
<DD>Define how tasks are bound to GPUs.</DD>
<DT><I>--gpu-freq</I></DT>
<DD>Specify GPU frequency and/or GPU memory frequency.</DD>
<DT><I>--mem-per-gpu</I></DT>
<DD>Memory allocated per GPU.</DD>
</DL>
<P>Jobs will be allocated specific generic resources as needed to satisfy
the request. If the job is suspended, those resources do not become available
for use by other jobs.</P>
<P>Job steps can be allocated generic resources from those allocated to the
job using the <I>--gres</I> option with the <I>srun</I> command as described
above. By default, a job step will be allocated all of the generic resources
that have been requested by the job, except those implicitly requested when a
job is exclusive. If desired, the job step may explicitly specify a
different generic resource count than the job.
This design choice was based upon a scenario where each job executes many
job steps. If job steps were granted access to all generic resources by
default, some job steps would need to explicitly specify zero generic resource
counts, which we considered more confusing. The job step can be allocated
specific generic resources and those resources will not be available to other
job steps. A simple example is shown below.</P>
<PRE>
#!/bin/bash
#
# gres_test.bash
# Submit as follows:
# sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
#
srun --gres=gpu:2 -n2 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
wait
</PRE>
<h2 id="AutoDetect">AutoDetect
<a class="slurm_link" href="#AutoDetect"></a>
</h2>
<p>If <i>AutoDetect=nvml</i>, <i>AutoDetect=nvidia</i>, <i>AutoDetect=rsmi</i>,
<i>AutoDetect=nrt</i>, or <i>AutoDetect=oneapi</i> are set in <i>gres.conf</i>,
configuration details will automatically be filled in for any system-detected
GPU. This removes the need to explicitly configure GPUs in gres.conf, though the
<i>Gres=</i> line in slurm.conf is still required in order to tell slurmctld how
many GRES to expect.</p>
<p>Note that <i>AutoDetect=nvml</i>, <i>AutoDetect=rsmi</i>,
and <i>AutoDetect=oneapi</i> need their corresponding GPU management libraries
installed on the node and found during Slurm configuration in order to work.
Both <i>AutoDetect=nvml</i> and <i>AutoDetect=nvidia</i> detect NVIDIA GPUs.
<i>AutoDetect=nvidia</i> (added in Slurm 24.11) doesn't require the
nvml library to be installed, but doesn't detect MIGs or NVlinks.</p>
<p>When AutoDetect needs to load a GPU management library as stated above it
will usually unload this library immediately after the slurmd is started
allowing other processes to access the library afterwards. This is not the case
when <i>AcctGatherEnergyType=acct_gather_energy/gpu</i> is set in the
slurm.conf. In this configuration the slurmd will keep the GPU library specified
by the AutoDetect option loaded to track GPU energy usage.</p>
<p><i>AutoDetect=nvml</i> and <i>AutoDetect=rsmi</i> also cause the slurmstepd
to load and hold open the library to perform accounting on GPU usage when the
associated step has requested a gpu. If you do not want this to happen
have <i>JobAcctGatherParams=DisableGPUAcct</i> in the slurm.conf.</p>
<p><i>AutoDetect=nvml</i>, <i>AutoDetect=rsmi</i> and <i>AutoDetect=oneapi</i>
will cause the slurmstepd to load and hold open the library to configure the gpu
frequency when <i>--gpu-freq</i> is requested with the job.</p>
<p>In the above situations when a GPU management library is held open by a
process it makes it so any other process can not configure the GPU. If you are
requiring to make changes in a Prolog or such you will want to not
have <i>AcctGatherEnergyType=acct_gather_energy/gpu</i> in the slurm.conf. If
jobs share gpus on the node, you may also need to set
<i>JobAcctGatherParams=DisableGPUAcct</i> in the slurm.conf.</p>
<P>By default, all system-detected devices are added to the node.
However, if <i>Type</i> and <i>File</i> in gres.conf match a GPU on
the system, any other properties explicitly specified (e.g.
<i>Cores</i> or <i>Links</i>) can be double-checked against it.
If the system-detected GPU differs from its matching GPU configuration, then the
GPU is omitted from the node with an error.
This allows <i>gres.conf</i> to serve as an optional sanity check and notifies
administrators of any unexpected changes in GPU properties.
</P>
<p>If not all system-detected devices are specified by the slurm.conf
configuration, then the relevant slurmd will be drained. However, it is still
possible to use a subset of the devices found on the system if they are
specified manually (with AutoDetect disabled) in gres.conf.
</p>
<P>Example <I>gres.conf</I> file:</P>
<PRE>
# Configure four GPUs (with MPS), plus bandwidth
AutoDetect=nvml
Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3
Name=mps Count=200 File=/dev/nvidia0
Name=mps Count=200 File=/dev/nvidia1
Name=mps Count=100 File=/dev/nvidia2
Name=mps Count=100 File=/dev/nvidia3
Name=bandwidth Type=lustre Count=4G Flags=CountOnly
</PRE>
<p> In this example, since <i>AutoDetect=nvml</i> is specified, <i>Cores</i>
for each GPU will be checked against a corresponding GPU found on the system
matching the <i>Type</i> and <i>File</i> specified.
Since <i>Links</i> is not specified, it will be automatically filled in
according to what is found on the system.
If a matching system GPU is not found, no validation takes place and the GPU is
assumed to be as the configuration says.
</p>
<P>For <i>Type</i> to match a system-detected device, it must either exactly
match or be a substring of the GPU name reported by slurmd via the AutoDetect
mechanism. This GPU name will have all spaces replaced with underscores. To see
the detected GPUs and their names, run: <code class="commandline">slurmd -C
</code>
<PRE>
$ slurmd -C
NodeName=node0 ... Gres=gpu:geforce_rtx_2060:1 ...
Found gpu:geforce_rtx_2060:1 with Autodetect=nvml (Substring of gpu name may be used instead)
UpTime=...
</PRE>
<P>In this example, the GPU's name is reported as
<code class="commandline">geforce_rtx_2060</code>. So in your slurm.conf and
gres.conf, the GPU <i>Type</i> can be set to <code class="commandline">
geforce</code>, <code class="commandline">rtx</code>, <code class="commandline">
2060</code>, <code class="commandline">geforce_rtx_2060</code>, or any other
substring, and <b>slurmd</b> should be able to match it to the system-detected
device <code class="commandline">geforce_rtx_2060</code>.
To check your configuration you may run: <code class="commandline">slurmd -G
</code> This will test and print the gres setup based on the current
configuration, including any autodetected gres that are being ignored.
<h2 id="Accounting">Accounting
<a class="slurm_link" href="#Accounting"></a>
</h2>
<p>GPU memory and GPU utilization can be tracked as
<a href="https://slurm.schedmd.com/tres.html">TRES</a> for tasks using GPU
resources. If <code>AccountingStorageTRES=gres/gpu</code> is configured,
gres/gpumem and gres/gpuutil will automatically be configured and gathered from
GPU jobs. gres/gpumem and gres/gpuutil can also be set individually when
gres/gpu is not set.</p>
<p>gres/gpumem and gres/gpuutil are only available for NVIDIA GPUs when using
<code>AutoDetect=nvml</code>, and AMD GPUs when using
<code>AutoDetect=rsmi</code>.</p>
<p>NVML does not support utilization metrics for MIGs, so Slurm does not
provide gpumem or gpuutil accounting for MIG devices. See
<a href="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#gpu-utilization-metrics">
NVIDIA's MIG User Guide</a>.</p>
<p>Here is an example node with two NVIDIA A100 GPUs:</p>
<pre>
$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-...)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-...)
</pre>
<p>The following is an excerpt from an example slurm.conf and gres.conf which
will automatically enable tracking for gres/gpumem and gres/gpuutil.</p>
<p>slurm.conf:</p>
<pre>
AccountingStorageTres=gres/gpu
NodeName=n1 Gres=gpu:a100:2
</pre>
<p>gres.conf:</p>
<pre>
AutoDetect=nvml
</pre>
<p>Here's an example of a job with two tasks that uses one GPU per task, two
GPUs total.</p>
<pre>
$ srun --tres-per-task=gres/gpu:1 -n2 --gpus=2 --mem=2G gpu_burn
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-...)
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-...)
</pre>
<p>After the job has finished, we can see utilization details of these TRES
using sacct. Note that there are multiple tasks here and we can use
TRESUsageIn[Min,Max,Ave,Tot] to examine the "highwater marks" for gpumem and
gpuutil over all the ranks in the step, as is true for other TRESUsage*
values:</p>
<pre>
$ sacct -j 1277.0 --format=tresusageinave -p
TRESUsageInAve|
cpu=00:00:11,energy=0,fs/disk=87613,gres/gpumem=36266M,gres/gpuutil=100,mem=628748K,pages=0,vmem=0|
$ sacct -j 1277.0 --format=tresusageintot -p
TRESUsageInTot|
cpu=00:00:22,energy=0,fs/disk=175227,gres/gpumem=72532M,gres/gpuutil=200,mem=1257496K,pages=0,vmem=0|
</pre>
<h2 id="GPU_Management">GPU Management
<a class="slurm_link" href="#GPU_Management"></a>
</h2>
<P>In the case of Slurm's GRES plugin for GPUs, the environment variable
<code class="commandline">CUDA_VISIBLE_DEVICES</code>
is set for each job step to determine which GPUs are
available for its use on each node. This environment variable is only set
when tasks are launched on a specific compute node (no global environment
variable is set for the <i>salloc</i> command and the environment variable set
for the <i>sbatch</i> command only reflects the GPUs allocated to that job
on that node, node zero of the allocation).
CUDA version 3.1 (or higher) uses this environment
variable in order to run multiple jobs or job steps on a node with GPUs
and ensure that the resources assigned to each are unique. In the example
above, the allocated node may have four or more graphics devices. In that
case, <code class="commandline">CUDA_VISIBLE_DEVICES</code>
will reference unique devices for each file and
the output might resemble this:</P>
<PRE>
JobStep=1234.0 CUDA_VISIBLE_DEVICES=0,1
JobStep=1234.1 CUDA_VISIBLE_DEVICES=2
JobStep=1234.2 CUDA_VISIBLE_DEVICES=3
</PRE>
<p><b>NOTE</b>: Be sure to specify the <I>File</I> parameters in the
<I>gres.conf</I> file and ensure they are in the increasing numeric order.</p>
<p>The <code class="commandline">CUDA_VISIBLE_DEVICES</code>
environment variable will also be set in the job's Prolog and Epilog programs.
Note that the environment variable set for the job may differ from that set for
the Prolog and Epilog if Slurm is configured to constrain the device files
visible to a job using Linux cgroup.
This is because the Prolog and Epilog programs run <u>outside</u> of any Linux
cgroup while the job runs <u>inside</u> of the cgroup and may thus have a
different set of visible devices.
For example, if a job is allocated the device "/dev/nvidia1", then
<code class="commandline">CUDA_VISIBLE_DEVICES</code> will be set to a value of
"1" in the Prolog and Epilog while the job's value of
<code class="commandline">CUDA_VISIBLE_DEVICES</code> will be set to a
value of "0" (i.e. the first GPU device visible to the job).
For more information see the
<a href="prolog_epilog.html">Prolog and Epilog Guide</a>.</p>
<p>When possible, Slurm automatically determines the GPUs on the system using
NVML. NVML (which powers the
<code class="commandline">nvidia-smi</code> tool) numbers GPUs in order by their
PCI bus IDs. For this numbering to match the numbering reported by CUDA, the
<code class="commandline">CUDA_DEVICE_ORDER</code> environmental variable must
be set to <code class="commandline">CUDA_DEVICE_ORDER=PCI_BUS_ID</code>.</p>
<p>GPU device files (e.g. <i>/dev/nvidia1</i>) are
based on the Linux minor number assignment, while NVML's device numbers are
assigned via PCI bus ID, from lowest to highest. Mapping between these two is
nondeterministic and system dependent, and could vary between boots after
hardware or OS changes. For the most part, this assignment seems fairly stable.
However, an after-bootup check is required to guarantee that a GPU device is
assigned to a specific device file.</p>
<p>Please consult the
<a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars">
NVIDIA CUDA documentation</a> for more information about the
<code class="commandline">CUDA_VISIBLE_DEVICES</code> and
<code class="commandline">CUDA_DEVICE_ORDER</code> environmental variables.</p>
<h2 id="MPS_Management">MPS Management
<a class="slurm_link" href="#MPS_Management"></a>
</h2>
<p><a href="https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf">
CUDA Multi-Process Service (MPS)</a> provides a mechanism where GPUs can be
shared by multiple jobs, where each job is allocated some percentage of the
GPU's resources.
The total count of MPS resources available on a node should be configured in
the <I>slurm.conf</I> file (e.g. "NodeName=tux[1-16] Gres=gpu:2,mps:200").
Several options are available for configuring MPS in the <I>gres.conf</I> file
as listed below with examples following that:</p>
<ol>
<li>No MPS configuration: The count of gres/mps elements defined in the
<I>slurm.conf</I> will be evenly distributed across all GPUs configured on the
node. For example, "NodeName=tux[1-16] Gres=gpu:2,mps:200" will configure
a count of 100 gres/mps resources on each of the two GPUs.</li>
<li>MPS configuration includes only the <I>Name</I> and <I>Count</I> parameters:
The count of gres/mps elements will be evenly distributed across all GPUs
configured on the node. This is similar to case 1, but places duplicate
configuration in the gres.conf file.</li>
<li>MPS configuration includes the <I>Name</I>, <I>File</I> and <I>Count</I>
parameters: Each <I>File</I> parameter should identify the device file path of a
GPU and the <I>Count</I> should identify the number of gres/mps resources
available for that specific GPU device.
This may be useful in a heterogeneous environment.
For example, some GPUs on a node may be more powerful than others and thus be
associated with a higher gres/mps count.
Another use case would be to prevent some GPUs from being used for MPS (i.e.
they would have an MPS count of zero).</li>
</ol>
<p>Note that <I>Type</I> and <I>Cores</I> parameters for gres/mps are ignored.
That information is copied from the gres/gpu configuration.</p>
<p>Note the <I>Count</I> parameter is translated to a percentage, so the value
would typically be a multiple of 100.</p>
<p>Note that if NVIDIA's NVML library is installed, the GPU configuration
(i.e. <I>Type</I>, <I>File</I>, <I>Cores</I> and <I>Links</I> data) will be
automatically gathered from the library and need not be recorded in the
<I>gres.conf</I> file.</p>
<p>By default, job requests for MPS are required to fit on a single gpu on
each node. This can be overridden with a flag in the <I>slurm.conf</I>
configuration file. See <a href="slurm.conf.html#OPT_MULTIPLE_SHARING_GRES_PJ">
OPT_MULTIPLE_SHARING_GRES_PJ</a>.</p>
<p>Note the same GPU can be allocated either as a GPU type of GRES or as
an MPS type of GRES, but not both.
In other words, once a GPU has been allocated as a gres/gpu resource it will
not be available as a gres/mps.
Likewise, once a GPU has been allocated as a gres/mps resource it will
not be available as a gres/gpu.
However the same GPU can be allocated as MPS generic resources to multiple jobs
belonging to multiple users, so long as the total count of MPS allocated to
jobs does not exceed the configured count.
Also, since shared GRES (MPS) cannot be allocated at the same time as a sharing
GRES (GPU) this option only allocates all sharing GRES and no underlying shared
GRES.
Some example configurations for Slurm's gres.conf file are shown below.</p>
<PRE>
# Example 1 of gres.conf
# Configure four GPUs (with MPS)
AutoDetect=nvml
Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3
# Set gres/mps Count value to 100 on each of the 4 available GPUs
Name=mps Count=400
</PRE>
<a id="MPS_config_example_2"></a>
<PRE>
# Example 2 of gres.conf
# Configure four different GPU types (with MPS)
AutoDetect=nvml
Name=gpu Type=gtx1080 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gtx1070 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=gtx1060 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=gtx1050 File=/dev/nvidia3 Cores=2,3
Name=mps Count=1300 File=/dev/nvidia0
Name=mps Count=1200 File=/dev/nvidia1
Name=mps Count=1100 File=/dev/nvidia2
Name=mps Count=1000 File=/dev/nvidia3
</PRE>
<p><b>NOTE</b>: <i>gres/mps</i> requires the use of the <i>select/cons_tres</i>
plugin.</p>
<p>Job requests for MPS will be processed the same as any other GRES except
that the request must be satisfied using only one GPU per node and only one
GPU per node may be configured for use with MPS.
For example, a job request for "--gres=mps:50" will not be satisfied by using
20 percent of one GPU and 30 percent of a second GPU on a single node.
Multiple jobs from different users can use MPS on a node at the same time.
Note that GRES types of GPU <u>and</u> MPS can not be requested within
a single job.
Also jobs requesting MPS resources can not specify a GPU frequency.</p>
<p>A prolog program should be used to start and stop MPS servers as needed.
A sample prolog script to do this is included with the Slurm distribution in
the location <i>etc/prolog.example</i>.
Its mode of operation is if a job is allocated gres/mps resources then the
Prolog will have the <code class="commandline">CUDA_VISIBLE_DEVICES</code>,
<code class="commandline">CUDA_MPS_ACTIVE_THREAD_PERCENTAGE</code>, and
<code class="commandline">SLURM_JOB_UID</code> environment variables set.
The Prolog should then make sure that an MPS server is started for that GPU
and user (UID == User ID).
It also records the GPU device ID in a local file.
If a job is allocated gres/gpu resources then the Prolog will have the
<code class="commandline">CUDA_VISIBLE_DEVICES</code> and
<code class="commandline">SLURM_JOB_UID</code> environment variables set
(no <code class="commandline">CUDA_MPS_ACTIVE_THREAD_PERCENTAGE</code>).
The Prolog should then terminate any MPS server associated with that GPU.
It may be necessary to modify this script as needed for the local environment.
For more information see the
<a href="prolog_epilog.html">Prolog and Epilog Guide</a>.</p>
<p>Jobs requesting MPS resources will have the
<code class="commandline">CUDA_VISIBLE_DEVICES</code>
and <code class="commandline">CUDA_DEVICE_ORDER</code> environment variables set.
The device ID is relative to those resources under MPS server control and will
always have a value of zero in the current implementation (only one GPU will be
usable in MPS mode per node).
The job will also have the
<code class="commandline">CUDA_MPS_ACTIVE_THREAD_PERCENTAGE</code>
environment variable set to that job's percentage of MPS resources available on
the assigned GPU.
The percentage will be calculated based upon the portion of the configured
Count on the Gres is allocated to a job of step.
For example, a job requesting "--gres=mps:200" and using
<a href="#MPS_config_example_2">configuration example 2</a> above would be
allocated<br>
15% of the gtx1080 (File=/dev/nvidia0, 200 x 100 / 1300 = 15), or<br>
16% of the gtx1070 (File=/dev/nvidia0, 200 x 100 / 1200 = 16), or<br>
18% of the gtx1060 (File=/dev/nvidia0, 200 x 100 / 1100 = 18), or<br>
20% of the gtx1050 (File=/dev/nvidia0, 200 x 100 / 1000 = 20).</p>
<p>An alternate mode of operation would be to permit jobs to be allocated whole
GPUs then trigger the starting of an MPS server based upon comments in the job.
For example, if a job is allocated whole GPUs then search for a comment of
"mps-per-gpu" or "mps-per-node" in the job (using the "scontrol show job"
command) and use that as a basis for starting one MPS daemon per GPU or across
all GPUs respectively.</p>
<p>Please consult the
<a href="https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf">
NVIDIA Multi-Process Service documentation</a> for more information about MPS.</p>
<p>
Note that a vulnerability exists in previous versions of the NVIDIA driver that
may affect users when sharing GPUs. More information can be found in
<a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-6260">
CVE-2018-6260</a> and in the
<a href="https://nvidia.custhelp.com/app/answers/detail/a_id/4772">
Security Bulletin: NVIDIA GPU Display Driver - February 2019</a>.</p>
<p>NVIDIA MPS has a built-in limitation regarding GPU sharing among different
users. Only one user on a system may have an active MPS server, and the MPS
control daemon will queue MPS server activation requests from separate users,
leading to serialized exclusive access of the GPU between users (see
<a href="https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_1_1">
Section 2.3.1.1 - Limitations</a> in the MPS docs). So different users cannot
truly run concurrently on the GPU with MPS; rather, the GPU will be time-sliced
between the users (for a diagram depicting this process, see
<a href="https://docs.nvidia.com/deploy/mps/index.html#topic_4_3">
Section 3.3 - Provisioning Sequence</a> in the MPS docs).</p>
<h2 id="MIG_Management">MIG Management
<a class="slurm_link" href="#MIG_Management"></a>
</h2>
<p>Beginning in version 21.08, Slurm now supports NVIDIA
<i>Multi-Instance GPU</i> (MIG) devices. This feature allows some newer NVIDIA
GPUs (like the A100) to split up a GPU into up to seven separate, isolated GPU
instances. Slurm can treat these MIG instances as individual GPUs, complete with
cgroup isolation and task binding.</p>
<p>To configure MIGs in Slurm, specify
<code class="commandline">AutoDetect=nvml</code> in <i>gres.conf</i> for the
nodes with MIGs, and specify <code class="commandline">Gres</code>
in <i>slurm.conf</i> as if the MIGs were regular GPUs, like this:
<code class="commandline">NodeName=tux[1-16] gres=gpu:2</code>. An optional
GRES type can be specified to distinguish MIGs of different sizes from each
other, as well as from other GPUs in the cluster. This type must be a substring
of the "MIG Profile" string as reported by the node in its slurmd log under the
<code class="commandline">debug2</code> log level. Here is an example slurm.conf
for a system with 2 gpus, one of which is partitioned into 2 MIGs where the
"MIG Profile" is <code class="commandline">nvidia_a100_3g.20gb</code>:</p>
<pre>
AccountingStorageTRES=gres/gpu,gres/gpu:a100,gres/gpu:a100_3g.20gb
GresTypes=gpu
NodeName=tux[1-16] gres=gpu:a100:1,gpu:a100_3g.20gb:2
</pre>
<p>The <a href="gres.conf.html#OPT_MultipleFiles">MultipleFiles</a> parameter
allows you to specify multiple device files for the GPU card.</p>
<p>The sanity-check AutoDetect mode is not supported for MIGs.
Slurm expects MIG devices to already be partitioned, and does not support
dynamic MIG partitioning.</p>
<p>For more information on NVIDIA MIGs (including how to partition them), see
<a href="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html">
the MIG user guide</a>.</p>
<h2 id="Sharding">Sharding
<a class="slurm_link" href="#Sharding"></a>
</h2>
<p>
Sharding provides a generic mechanism where GPUs can be
shared by multiple jobs. While it does permit multiple jobs to run on a given
GPU it does not fence the processes running on the GPU, it only allows the GPU
to be shared. Sharding, therefore, works best with homogeneous workflows. It is
recommended to limit the number of shards on a node to equal the max possible
jobs that can run simultaneously on the node (i.e. cores).
The total count of shards available on a node should be configured in
the <I>slurm.conf</I> file (e.g. "NodeName=tux[1-16] Gres=gpu:2,shard:64").
Several options are available for configuring shards in the <I>gres.conf</I> file
as listed below with examples following that:</p>
<ol>
<li>No Shard configuration: The count of gres/shard elements defined in the
<I>slurm.conf</I> will be evenly distributed across all GPUs configured on the
node. For example, "NodeName=tux[1-16] Gres=gpu:2,shard:64" will configure
a count of 32 gres/shard resources on each of the two GPUs.</li>
<li>Shard configuration includes only the <I>Name</I> and <I>Count</I> parameters:
The count of gres/shard elements will be evenly distributed across all GPUs
configured on the node. This is similar to case 1, but places duplicate
configuration in the gres.conf file.</li>
<li>Shard configuration includes the <I>Name</I>, <I>File</I> and <I>Count</I>
parameters: Each <I>File</I> parameter should identify the device file path of a
GPU and the <I>Count</I> should identify the number of gres/shard resources
available for that specific GPU device.
This may be useful in a heterogeneous environment.
For example, some GPUs on a node may be more powerful than others and thus be
associated with a higher gres/shard count.
Another use case would be to prevent some GPUs from being used for sharding (i.e.
they would have a shard count of zero).</li>
</ol>
<p>Note that <I>Type</I> and <I>Cores</I> parameters for gres/shard are ignored.
That information is copied from the gres/gpu configuration.</p>
<p>Note that if NVIDIA's NVML library is installed, the GPU configuration
(i.e. <I>Type</I>, <I>File</I>, <I>Cores</I> and <I>Links</I> data) will be
automatically gathered from the library and need not be recorded in the
<I>gres.conf</I> file.</p>
<p>Note the same GPU can be allocated either as a GPU type of GRES or as
a shard type of GRES, but not both.
In other words, once a GPU has been allocated as a gres/gpu resource it will
not be available as a gres/shard.
Likewise, once a GPU has been allocated as a gres/shard resource it will
not be available as a gres/gpu.
However the same GPU can be allocated as shard generic resources to multiple jobs
belonging to multiple users, so long as the total count of SHARD allocated to
jobs does not exceed the configured count.</p>
<p>By default, job requests for shards are required to fit on a single gpu on
each node. This can be overridden with a flag in the <I>slurm.conf</I>
configuration file. See <a href="slurm.conf.html#OPT_MULTIPLE_SHARING_GRES_PJ">
OPT_MULTIPLE_SHARING_GRES_PJ</a>.</p>
<p>In order for this to be correctly configured, the appropriate nodes need
to have the <i>shard</i> keyword added as a GRES for the relevant nodes as
well as being added to the <i>GresTypes</i> parameter. If you want the shards
to be tracked in accounting then <i>shard</i> also needs to be added to
<i>AccountingStorageTRES</i>.
See the relevant settings in an example slurm.conf:
<pre>
AccountingStorageTRES=gres/gpu,gres/shard
GresTypes=gpu,shard
NodeName=tux[1-16] Gres=gpu:2,shard:64
</pre>
<p>Some example configurations for Slurm's gres.conf file are shown below.</p>
<PRE>
# Example 1 of gres.conf
# Configure four GPUs (with Sharding)
AutoDetect=nvml
Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3
# Set gres/shard Count value to 8 on each of the 4 available GPUs
Name=shard Count=32
</PRE>
<a id="Shard_config_example_2"></a>
<PRE>
# Example 2 of gres.conf
# Configure four different GPU types (with Sharding)
AutoDetect=nvml
Name=gpu Type=gtx1080 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gtx1070 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=gtx1060 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=gtx1050 File=/dev/nvidia3 Cores=2,3
Name=shard Count=8 File=/dev/nvidia0
Name=shard Count=8 File=/dev/nvidia1
Name=shard Count=8 File=/dev/nvidia2
Name=shard Count=8 File=/dev/nvidia3
</PRE>
<p><b>NOTE</b>: <i>gres/shard</i> requires the use of the <i>select/cons_tres</i>
plugin.</p>
<p>Job requests for shards can not specify a GPU frequency.</p>
<p>Jobs requesting shards resources will have the
<code class="commandline">CUDA_VISIBLE_DEVICES</code>, <code class="commandline">ROCR_VISIBLE_DEVICES</code>,
or <code class="commandline">GPU_DEVICE_ORDINAL</code> environment variable set
which would be the same as if it were a GPU.
</p>
<p>Steps with shards have<code class="commandline">SLURM_SHARDS_ON_NODE</code>
set indicating the number of shards allocated.</p>
<p style="text-align: center;">Last modified 10 April 2025</p>
<!--#include virtual="footer.txt"-->