| <!--#include virtual="header.txt"--> |
| |
| <h1>Generic Resource (GRES) Scheduling</h1> |
| |
| <h2 id="contents">Contents<a class="slurm_link" href="#contents"></a></h2> |
| <ul> |
| <li><a href="#Overview">Overview</a></li> |
| <li><a href="#Configuration">Configuration</a></li> |
| <li><a href="#Running_Jobs">Running Jobs</a></li> |
| <li><a href="#AutoDetect">AutoDetect</a></li> |
| <li><a href="#Accounting">Accounting</a></li> |
| <li><a href="#GPU_Management">GPU Management</a></li> |
| <li><a href="#MPS_Management">MPS Management</a></li> |
| <li><a href="#MIG_Management">MIG Management</a></li> |
| <li><a href="#Sharding">Sharding</a></li> |
| </ul> |
| |
| <h2 id="Overview">Overview<a class="slurm_link" href="#Overview"></a></h2> |
| <p>Slurm supports the ability to define and schedule arbitrary Generic RESources |
| (GRES). Additional built-in features are enabled for specific GRES types, |
| including Graphics Processing Units (GPUs), CUDA Multi-Process Service (MPS) |
| devices, and Sharding through an extensible plugin mechanism.</p> |
| |
| <h2 id="Configuration">Configuration |
| <a class="slurm_link" href="#Configuration"></a> |
| </h2> |
| |
| <P>By default, no GRES are enabled in the cluster's configuration. |
| You must explicitly specify which GRES are to be managed in the |
| <I>slurm.conf</I> configuration file. The configuration parameters of |
| interest are <B>GresTypes</B> and <B>Gres</B>. |
| </P> |
| |
| <P> |
| For more details, see <a href="slurm.conf.html#OPT_GresTypes">GresTypes</a> and <a href="slurm.conf.html#OPT_Gres_1">Gres</a> in the <I>slurm.conf</I> man page. |
| </P> |
| |
| <P>Note that the GRES specification for each node works in the same fashion |
| as the other resources managed. Nodes which are found to have fewer resources |
| than configured will be placed in a DRAIN state.</P> |
| |
| <P>Snippet from an example <I>slurm.conf</I> file:</P> |
| <PRE> |
| # Configure four GPUs (with MPS), plus bandwidth |
| GresTypes=gpu,mps,bandwidth |
| NodeName=tux[0-7] Gres=gpu:tesla:2,gpu:kepler:2,mps:400,bandwidth:lustre:no_consume:4G |
| </PRE> |
| |
| <P>Each compute node with generic resources typically contain a <I>gres.conf</I> |
| file describing which resources are available on the node, their count, |
| associated device files and cores which should be used with those resources.</P> |
| |
| <p>There are cases where you may want to define a Generic Resource on a node |
| without specifying a quantity of that GRES. For example, the filesystem type |
| of a node doesn't decrease in value as jobs run on that node. |
| You can use the <b>no_consume</b> flag to allow users to request a GRES |
| without having a defined count that gets used as it is requested.</p> |
| |
| <P> |
| To view available <I>gres.conf</I> configuration parameters, see the |
| <a href="gres.conf.html">gres.conf man page</a>.</P> |
| |
| <h2 id="Running_Jobs">Running Jobs |
| <a class="slurm_link" href="#Running_Jobs"></a> |
| </h2> |
| |
| <P>Jobs will not be allocated any generic resources unless specifically |
| requested at job submit time using the options:</P> |
| <DL> |
| <DT><I>--gres</I></DT> |
| <DD>Generic resources required per node</DD> |
| <DT><I>--gpus</I></DT> |
| <DD>GPUs required per job</DD> |
| <DT><I>--gpus-per-node</I></DT> |
| <DD>GPUs required per node. Equivalent to the <I>--gres</I> option for GPUs.</DD> |
| <DT><I>--gpus-per-socket</I></DT> |
| <DD>GPUs required per socket. Requires the job to specify a task socket.</DD> |
| <DT><I>--gpus-per-task</I></DT> |
| <DD>GPUs required per task. Requires the job to specify a task count.</DD> |
| </DL> |
| |
| <P>All of these options are supported by the <I>salloc</I>, <I>sbatch</I> and |
| <I>srun</I> commands. |
| Note that all of the <I>--gpu*</I> options are only supported by Slurm's |
| select/cons_tres plugin. |
| Jobs requesting these options when the select/cons_tres plugin is <U>not</U> |
| configured will be rejected. |
| The <I>--gres</I> option requires an argument specifying which generic resources |
| are required and how many resources using the form <I>name[:type:count]</I> |
| while all of the <I>--gpu*</I> options require an argument of the form |
| <I>[type]:count</I>. |
| The <I>name</I> is the same name as |
| specified by the <I>GresTypes</I> and <I>Gres</I> configuration parameters. |
| <I>type</I> identifies a specific type of that generic resource (e.g. a |
| specific model of GPU). |
| <I>count</I> specifies how many resources are required and has a default |
| value of 1. For example:<BR> |
| <I>sbatch --gres=gpu:kepler:2 ...</I>.</P> |
| |
| <p>Requests for typed vs non-typed generic resources must be consistent |
| within a job. For example, if you request <i>--gres=gpu:2</i> with |
| <b>sbatch</b>, you would not be able to request <i>--gres=gpu:tesla:2</i> |
| with <b>srun</b> to create a job step. The same holds true in reverse, |
| if you request a typed GPU to create a job allocation, you should request |
| a GPU of the same type to create a job step.</p> |
| |
| <P>Several additional resource requirement specifications are available |
| specifically for GPUs and detailed descriptions about these options are |
| available in the man pages for the job submission commands. |
| As for the <I>--gpu*</I> option, these options are only supported by Slurm's |
| select/cons_tres plugin.</P> |
| </P> |
| <DL> |
| <DT><I>--cpus-per-gpu</I></DT> |
| <DD>Count of CPUs allocated per GPU.</DD> |
| <DT><I>--gpu-bind</I></DT> |
| <DD>Define how tasks are bound to GPUs.</DD> |
| <DT><I>--gpu-freq</I></DT> |
| <DD>Specify GPU frequency and/or GPU memory frequency.</DD> |
| <DT><I>--mem-per-gpu</I></DT> |
| <DD>Memory allocated per GPU.</DD> |
| </DL> |
| |
| <P>Jobs will be allocated specific generic resources as needed to satisfy |
| the request. If the job is suspended, those resources do not become available |
| for use by other jobs.</P> |
| |
| <P>Job steps can be allocated generic resources from those allocated to the |
| job using the <I>--gres</I> option with the <I>srun</I> command as described |
| above. By default, a job step will be allocated all of the generic resources |
| that have been requested by the job, except those implicitly requested when a |
| job is exclusive. If desired, the job step may explicitly specify a |
| different generic resource count than the job. |
| This design choice was based upon a scenario where each job executes many |
| job steps. If job steps were granted access to all generic resources by |
| default, some job steps would need to explicitly specify zero generic resource |
| counts, which we considered more confusing. The job step can be allocated |
| specific generic resources and those resources will not be available to other |
| job steps. A simple example is shown below.</P> |
| |
| <PRE> |
| #!/bin/bash |
| # |
| # gres_test.bash |
| # Submit as follows: |
| # sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash |
| # |
| srun --gres=gpu:2 -n2 --exclusive show_device.sh & |
| srun --gres=gpu:1 -n1 --exclusive show_device.sh & |
| srun --gres=gpu:1 -n1 --exclusive show_device.sh & |
| wait |
| </PRE> |
| |
| <h2 id="AutoDetect">AutoDetect |
| <a class="slurm_link" href="#AutoDetect"></a> |
| </h2> |
| |
| <p>If <i>AutoDetect=nvml</i>, <i>AutoDetect=nvidia</i>, <i>AutoDetect=rsmi</i>, |
| <i>AutoDetect=nrt</i>, or <i>AutoDetect=oneapi</i> are set in <i>gres.conf</i>, |
| configuration details will automatically be filled in for any system-detected |
| GPU. This removes the need to explicitly configure GPUs in gres.conf, though the |
| <i>Gres=</i> line in slurm.conf is still required in order to tell slurmctld how |
| many GRES to expect.</p> |
| |
| <p>Note that <i>AutoDetect=nvml</i>, <i>AutoDetect=rsmi</i>, |
| and <i>AutoDetect=oneapi</i> need their corresponding GPU management libraries |
| installed on the node and found during Slurm configuration in order to work. |
| Both <i>AutoDetect=nvml</i> and <i>AutoDetect=nvidia</i> detect NVIDIA GPUs. |
| <i>AutoDetect=nvidia</i> (added in Slurm 24.11) doesn't require the |
| nvml library to be installed, but doesn't detect MIGs or NVlinks.</p> |
| |
| <p>When AutoDetect needs to load a GPU management library as stated above it |
| will usually unload this library immediately after the slurmd is started |
| allowing other processes to access the library afterwards. This is not the case |
| when <i>AcctGatherEnergyType=acct_gather_energy/gpu</i> is set in the |
| slurm.conf. In this configuration the slurmd will keep the GPU library specified |
| by the AutoDetect option loaded to track GPU energy usage.</p> |
| |
| <p><i>AutoDetect=nvml</i> and <i>AutoDetect=rsmi</i> also cause the slurmstepd |
| to load and hold open the library to perform accounting on GPU usage when the |
| associated step has requested a gpu. If you do not want this to happen |
| have <i>JobAcctGatherParams=DisableGPUAcct</i> in the slurm.conf.</p> |
| |
| <p><i>AutoDetect=nvml</i>, <i>AutoDetect=rsmi</i> and <i>AutoDetect=oneapi</i> |
| will cause the slurmstepd to load and hold open the library to configure the gpu |
| frequency when <i>--gpu-freq</i> is requested with the job.</p> |
| |
| <p>In the above situations when a GPU management library is held open by a |
| process it makes it so any other process can not configure the GPU. If you are |
| requiring to make changes in a Prolog or such you will want to not |
| have <i>AcctGatherEnergyType=acct_gather_energy/gpu</i> in the slurm.conf. If |
| jobs share gpus on the node, you may also need to set |
| <i>JobAcctGatherParams=DisableGPUAcct</i> in the slurm.conf.</p> |
| |
| <P>By default, all system-detected devices are added to the node. |
| However, if <i>Type</i> and <i>File</i> in gres.conf match a GPU on |
| the system, any other properties explicitly specified (e.g. |
| <i>Cores</i> or <i>Links</i>) can be double-checked against it. |
| If the system-detected GPU differs from its matching GPU configuration, then the |
| GPU is omitted from the node with an error. |
| This allows <i>gres.conf</i> to serve as an optional sanity check and notifies |
| administrators of any unexpected changes in GPU properties. |
| </P> |
| |
| <p>If not all system-detected devices are specified by the slurm.conf |
| configuration, then the relevant slurmd will be drained. However, it is still |
| possible to use a subset of the devices found on the system if they are |
| specified manually (with AutoDetect disabled) in gres.conf. |
| </p> |
| |
| <P>Example <I>gres.conf</I> file:</P> |
| <PRE> |
| # Configure four GPUs (with MPS), plus bandwidth |
| AutoDetect=nvml |
| Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1 |
| Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1 |
| Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3 |
| Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3 |
| Name=mps Count=200 File=/dev/nvidia0 |
| Name=mps Count=200 File=/dev/nvidia1 |
| Name=mps Count=100 File=/dev/nvidia2 |
| Name=mps Count=100 File=/dev/nvidia3 |
| Name=bandwidth Type=lustre Count=4G Flags=CountOnly |
| </PRE> |
| |
| <p> In this example, since <i>AutoDetect=nvml</i> is specified, <i>Cores</i> |
| for each GPU will be checked against a corresponding GPU found on the system |
| matching the <i>Type</i> and <i>File</i> specified. |
| Since <i>Links</i> is not specified, it will be automatically filled in |
| according to what is found on the system. |
| If a matching system GPU is not found, no validation takes place and the GPU is |
| assumed to be as the configuration says. |
| </p> |
| |
| <P>For <i>Type</i> to match a system-detected device, it must either exactly |
| match or be a substring of the GPU name reported by slurmd via the AutoDetect |
| mechanism. This GPU name will have all spaces replaced with underscores. To see |
| the detected GPUs and their names, run: <code class="commandline">slurmd -C |
| </code> |
| |
| <PRE> |
| $ slurmd -C |
| NodeName=node0 ... Gres=gpu:geforce_rtx_2060:1 ... |
| Found gpu:geforce_rtx_2060:1 with Autodetect=nvml (Substring of gpu name may be used instead) |
| UpTime=... |
| </PRE> |
| |
| <P>In this example, the GPU's name is reported as |
| <code class="commandline">geforce_rtx_2060</code>. So in your slurm.conf and |
| gres.conf, the GPU <i>Type</i> can be set to <code class="commandline"> |
| geforce</code>, <code class="commandline">rtx</code>, <code class="commandline"> |
| 2060</code>, <code class="commandline">geforce_rtx_2060</code>, or any other |
| substring, and <b>slurmd</b> should be able to match it to the system-detected |
| device <code class="commandline">geforce_rtx_2060</code>. |
| |
| To check your configuration you may run: <code class="commandline">slurmd -G |
| </code> This will test and print the gres setup based on the current |
| configuration, including any autodetected gres that are being ignored. |
| |
| <h2 id="Accounting">Accounting |
| <a class="slurm_link" href="#Accounting"></a> |
| </h2> |
| |
| <p>GPU memory and GPU utilization can be tracked as |
| <a href="https://slurm.schedmd.com/tres.html">TRES</a> for tasks using GPU |
| resources. If <code>AccountingStorageTRES=gres/gpu</code> is configured, |
| gres/gpumem and gres/gpuutil will automatically be configured and gathered from |
| GPU jobs. gres/gpumem and gres/gpuutil can also be set individually when |
| gres/gpu is not set.</p> |
| |
| <p>gres/gpumem and gres/gpuutil are only available for NVIDIA GPUs when using |
| <code>AutoDetect=nvml</code>, and AMD GPUs when using |
| <code>AutoDetect=rsmi</code>.</p> |
| |
| <p>NVML does not support utilization metrics for MIGs, so Slurm does not |
| provide gpumem or gpuutil accounting for MIG devices. See |
| <a href="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#gpu-utilization-metrics"> |
| NVIDIA's MIG User Guide</a>.</p> |
| |
| <p>Here is an example node with two NVIDIA A100 GPUs:</p> |
| <pre> |
| $ nvidia-smi -L |
| GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-...) |
| GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-...) |
| </pre> |
| |
| <p>The following is an excerpt from an example slurm.conf and gres.conf which |
| will automatically enable tracking for gres/gpumem and gres/gpuutil.</p> |
| |
| <p>slurm.conf:</p> |
| <pre> |
| AccountingStorageTres=gres/gpu |
| NodeName=n1 Gres=gpu:a100:2 |
| </pre> |
| |
| <p>gres.conf:</p> |
| <pre> |
| AutoDetect=nvml |
| </pre> |
| |
| <p>Here's an example of a job with two tasks that uses one GPU per task, two |
| GPUs total.</p> |
| <pre> |
| $ srun --tres-per-task=gres/gpu:1 -n2 --gpus=2 --mem=2G gpu_burn |
| GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-...) |
| GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-...) |
| </pre> |
| |
| <p>After the job has finished, we can see utilization details of these TRES |
| using sacct. Note that there are multiple tasks here and we can use |
| TRESUsageIn[Min,Max,Ave,Tot] to examine the "highwater marks" for gpumem and |
| gpuutil over all the ranks in the step, as is true for other TRESUsage* |
| values:</p> |
| <pre> |
| $ sacct -j 1277.0 --format=tresusageinave -p |
| TRESUsageInAve| |
| cpu=00:00:11,energy=0,fs/disk=87613,gres/gpumem=36266M,gres/gpuutil=100,mem=628748K,pages=0,vmem=0| |
| $ sacct -j 1277.0 --format=tresusageintot -p |
| TRESUsageInTot| |
| cpu=00:00:22,energy=0,fs/disk=175227,gres/gpumem=72532M,gres/gpuutil=200,mem=1257496K,pages=0,vmem=0| |
| </pre> |
| |
| <h2 id="GPU_Management">GPU Management |
| <a class="slurm_link" href="#GPU_Management"></a> |
| </h2> |
| |
| <P>In the case of Slurm's GRES plugin for GPUs, the environment variable |
| <code class="commandline">CUDA_VISIBLE_DEVICES</code> |
| is set for each job step to determine which GPUs are |
| available for its use on each node. This environment variable is only set |
| when tasks are launched on a specific compute node (no global environment |
| variable is set for the <i>salloc</i> command and the environment variable set |
| for the <i>sbatch</i> command only reflects the GPUs allocated to that job |
| on that node, node zero of the allocation). |
| CUDA version 3.1 (or higher) uses this environment |
| variable in order to run multiple jobs or job steps on a node with GPUs |
| and ensure that the resources assigned to each are unique. In the example |
| above, the allocated node may have four or more graphics devices. In that |
| case, <code class="commandline">CUDA_VISIBLE_DEVICES</code> |
| will reference unique devices for each file and |
| the output might resemble this:</P> |
| |
| <PRE> |
| JobStep=1234.0 CUDA_VISIBLE_DEVICES=0,1 |
| JobStep=1234.1 CUDA_VISIBLE_DEVICES=2 |
| JobStep=1234.2 CUDA_VISIBLE_DEVICES=3 |
| </PRE> |
| |
| <p><b>NOTE</b>: Be sure to specify the <I>File</I> parameters in the |
| <I>gres.conf</I> file and ensure they are in the increasing numeric order.</p> |
| |
| <p>The <code class="commandline">CUDA_VISIBLE_DEVICES</code> |
| environment variable will also be set in the job's Prolog and Epilog programs. |
| Note that the environment variable set for the job may differ from that set for |
| the Prolog and Epilog if Slurm is configured to constrain the device files |
| visible to a job using Linux cgroup. |
| This is because the Prolog and Epilog programs run <u>outside</u> of any Linux |
| cgroup while the job runs <u>inside</u> of the cgroup and may thus have a |
| different set of visible devices. |
| For example, if a job is allocated the device "/dev/nvidia1", then |
| <code class="commandline">CUDA_VISIBLE_DEVICES</code> will be set to a value of |
| "1" in the Prolog and Epilog while the job's value of |
| <code class="commandline">CUDA_VISIBLE_DEVICES</code> will be set to a |
| value of "0" (i.e. the first GPU device visible to the job). |
| For more information see the |
| <a href="prolog_epilog.html">Prolog and Epilog Guide</a>.</p> |
| |
| <p>When possible, Slurm automatically determines the GPUs on the system using |
| NVML. NVML (which powers the |
| <code class="commandline">nvidia-smi</code> tool) numbers GPUs in order by their |
| PCI bus IDs. For this numbering to match the numbering reported by CUDA, the |
| <code class="commandline">CUDA_DEVICE_ORDER</code> environmental variable must |
| be set to <code class="commandline">CUDA_DEVICE_ORDER=PCI_BUS_ID</code>.</p> |
| |
| <p>GPU device files (e.g. <i>/dev/nvidia1</i>) are |
| based on the Linux minor number assignment, while NVML's device numbers are |
| assigned via PCI bus ID, from lowest to highest. Mapping between these two is |
| nondeterministic and system dependent, and could vary between boots after |
| hardware or OS changes. For the most part, this assignment seems fairly stable. |
| However, an after-bootup check is required to guarantee that a GPU device is |
| assigned to a specific device file.</p> |
| |
| <p>Please consult the |
| <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars"> |
| NVIDIA CUDA documentation</a> for more information about the |
| <code class="commandline">CUDA_VISIBLE_DEVICES</code> and |
| <code class="commandline">CUDA_DEVICE_ORDER</code> environmental variables.</p> |
| |
| <h2 id="MPS_Management">MPS Management |
| <a class="slurm_link" href="#MPS_Management"></a> |
| </h2> |
| |
| <p><a href="https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf"> |
| CUDA Multi-Process Service (MPS)</a> provides a mechanism where GPUs can be |
| shared by multiple jobs, where each job is allocated some percentage of the |
| GPU's resources. |
| The total count of MPS resources available on a node should be configured in |
| the <I>slurm.conf</I> file (e.g. "NodeName=tux[1-16] Gres=gpu:2,mps:200"). |
| Several options are available for configuring MPS in the <I>gres.conf</I> file |
| as listed below with examples following that:</p> |
| |
| <ol> |
| <li>No MPS configuration: The count of gres/mps elements defined in the |
| <I>slurm.conf</I> will be evenly distributed across all GPUs configured on the |
| node. For example, "NodeName=tux[1-16] Gres=gpu:2,mps:200" will configure |
| a count of 100 gres/mps resources on each of the two GPUs.</li> |
| <li>MPS configuration includes only the <I>Name</I> and <I>Count</I> parameters: |
| The count of gres/mps elements will be evenly distributed across all GPUs |
| configured on the node. This is similar to case 1, but places duplicate |
| configuration in the gres.conf file.</li> |
| <li>MPS configuration includes the <I>Name</I>, <I>File</I> and <I>Count</I> |
| parameters: Each <I>File</I> parameter should identify the device file path of a |
| GPU and the <I>Count</I> should identify the number of gres/mps resources |
| available for that specific GPU device. |
| This may be useful in a heterogeneous environment. |
| For example, some GPUs on a node may be more powerful than others and thus be |
| associated with a higher gres/mps count. |
| Another use case would be to prevent some GPUs from being used for MPS (i.e. |
| they would have an MPS count of zero).</li> |
| </ol> |
| |
| <p>Note that <I>Type</I> and <I>Cores</I> parameters for gres/mps are ignored. |
| That information is copied from the gres/gpu configuration.</p> |
| |
| <p>Note the <I>Count</I> parameter is translated to a percentage, so the value |
| would typically be a multiple of 100.</p> |
| |
| <p>Note that if NVIDIA's NVML library is installed, the GPU configuration |
| (i.e. <I>Type</I>, <I>File</I>, <I>Cores</I> and <I>Links</I> data) will be |
| automatically gathered from the library and need not be recorded in the |
| <I>gres.conf</I> file.</p> |
| |
| <p>By default, job requests for MPS are required to fit on a single gpu on |
| each node. This can be overridden with a flag in the <I>slurm.conf</I> |
| configuration file. See <a href="slurm.conf.html#OPT_MULTIPLE_SHARING_GRES_PJ"> |
| OPT_MULTIPLE_SHARING_GRES_PJ</a>.</p> |
| |
| <p>Note the same GPU can be allocated either as a GPU type of GRES or as |
| an MPS type of GRES, but not both. |
| In other words, once a GPU has been allocated as a gres/gpu resource it will |
| not be available as a gres/mps. |
| Likewise, once a GPU has been allocated as a gres/mps resource it will |
| not be available as a gres/gpu. |
| However the same GPU can be allocated as MPS generic resources to multiple jobs |
| belonging to multiple users, so long as the total count of MPS allocated to |
| jobs does not exceed the configured count. |
| Also, since shared GRES (MPS) cannot be allocated at the same time as a sharing |
| GRES (GPU) this option only allocates all sharing GRES and no underlying shared |
| GRES. |
| Some example configurations for Slurm's gres.conf file are shown below.</p> |
| |
| <PRE> |
| # Example 1 of gres.conf |
| # Configure four GPUs (with MPS) |
| AutoDetect=nvml |
| Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1 |
| Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1 |
| Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3 |
| Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3 |
| # Set gres/mps Count value to 100 on each of the 4 available GPUs |
| Name=mps Count=400 |
| </PRE> |
| |
| <a id="MPS_config_example_2"></a> |
| <PRE> |
| # Example 2 of gres.conf |
| # Configure four different GPU types (with MPS) |
| AutoDetect=nvml |
| Name=gpu Type=gtx1080 File=/dev/nvidia0 Cores=0,1 |
| Name=gpu Type=gtx1070 File=/dev/nvidia1 Cores=0,1 |
| Name=gpu Type=gtx1060 File=/dev/nvidia2 Cores=2,3 |
| Name=gpu Type=gtx1050 File=/dev/nvidia3 Cores=2,3 |
| Name=mps Count=1300 File=/dev/nvidia0 |
| Name=mps Count=1200 File=/dev/nvidia1 |
| Name=mps Count=1100 File=/dev/nvidia2 |
| Name=mps Count=1000 File=/dev/nvidia3 |
| </PRE> |
| |
| <p><b>NOTE</b>: <i>gres/mps</i> requires the use of the <i>select/cons_tres</i> |
| plugin.</p> |
| |
| <p>Job requests for MPS will be processed the same as any other GRES except |
| that the request must be satisfied using only one GPU per node and only one |
| GPU per node may be configured for use with MPS. |
| For example, a job request for "--gres=mps:50" will not be satisfied by using |
| 20 percent of one GPU and 30 percent of a second GPU on a single node. |
| Multiple jobs from different users can use MPS on a node at the same time. |
| Note that GRES types of GPU <u>and</u> MPS can not be requested within |
| a single job. |
| Also jobs requesting MPS resources can not specify a GPU frequency.</p> |
| |
| <p>A prolog program should be used to start and stop MPS servers as needed. |
| A sample prolog script to do this is included with the Slurm distribution in |
| the location <i>etc/prolog.example</i>. |
| Its mode of operation is if a job is allocated gres/mps resources then the |
| Prolog will have the <code class="commandline">CUDA_VISIBLE_DEVICES</code>, |
| <code class="commandline">CUDA_MPS_ACTIVE_THREAD_PERCENTAGE</code>, and |
| <code class="commandline">SLURM_JOB_UID</code> environment variables set. |
| The Prolog should then make sure that an MPS server is started for that GPU |
| and user (UID == User ID). |
| It also records the GPU device ID in a local file. |
| If a job is allocated gres/gpu resources then the Prolog will have the |
| <code class="commandline">CUDA_VISIBLE_DEVICES</code> and |
| <code class="commandline">SLURM_JOB_UID</code> environment variables set |
| (no <code class="commandline">CUDA_MPS_ACTIVE_THREAD_PERCENTAGE</code>). |
| The Prolog should then terminate any MPS server associated with that GPU. |
| It may be necessary to modify this script as needed for the local environment. |
| For more information see the |
| <a href="prolog_epilog.html">Prolog and Epilog Guide</a>.</p> |
| |
| <p>Jobs requesting MPS resources will have the |
| <code class="commandline">CUDA_VISIBLE_DEVICES</code> |
| and <code class="commandline">CUDA_DEVICE_ORDER</code> environment variables set. |
| The device ID is relative to those resources under MPS server control and will |
| always have a value of zero in the current implementation (only one GPU will be |
| usable in MPS mode per node). |
| The job will also have the |
| <code class="commandline">CUDA_MPS_ACTIVE_THREAD_PERCENTAGE</code> |
| environment variable set to that job's percentage of MPS resources available on |
| the assigned GPU. |
| The percentage will be calculated based upon the portion of the configured |
| Count on the Gres is allocated to a job of step. |
| For example, a job requesting "--gres=mps:200" and using |
| <a href="#MPS_config_example_2">configuration example 2</a> above would be |
| allocated<br> |
| 15% of the gtx1080 (File=/dev/nvidia0, 200 x 100 / 1300 = 15), or<br> |
| 16% of the gtx1070 (File=/dev/nvidia0, 200 x 100 / 1200 = 16), or<br> |
| 18% of the gtx1060 (File=/dev/nvidia0, 200 x 100 / 1100 = 18), or<br> |
| 20% of the gtx1050 (File=/dev/nvidia0, 200 x 100 / 1000 = 20).</p> |
| |
| <p>An alternate mode of operation would be to permit jobs to be allocated whole |
| GPUs then trigger the starting of an MPS server based upon comments in the job. |
| For example, if a job is allocated whole GPUs then search for a comment of |
| "mps-per-gpu" or "mps-per-node" in the job (using the "scontrol show job" |
| command) and use that as a basis for starting one MPS daemon per GPU or across |
| all GPUs respectively.</p> |
| |
| <p>Please consult the |
| <a href="https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf"> |
| NVIDIA Multi-Process Service documentation</a> for more information about MPS.</p> |
| |
| <p> |
| Note that a vulnerability exists in previous versions of the NVIDIA driver that |
| may affect users when sharing GPUs. More information can be found in |
| <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-6260"> |
| CVE-2018-6260</a> and in the |
| <a href="https://nvidia.custhelp.com/app/answers/detail/a_id/4772"> |
| Security Bulletin: NVIDIA GPU Display Driver - February 2019</a>.</p> |
| |
| <p>NVIDIA MPS has a built-in limitation regarding GPU sharing among different |
| users. Only one user on a system may have an active MPS server, and the MPS |
| control daemon will queue MPS server activation requests from separate users, |
| leading to serialized exclusive access of the GPU between users (see |
| <a href="https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_1_1"> |
| Section 2.3.1.1 - Limitations</a> in the MPS docs). So different users cannot |
| truly run concurrently on the GPU with MPS; rather, the GPU will be time-sliced |
| between the users (for a diagram depicting this process, see |
| <a href="https://docs.nvidia.com/deploy/mps/index.html#topic_4_3"> |
| Section 3.3 - Provisioning Sequence</a> in the MPS docs).</p> |
| |
| <h2 id="MIG_Management">MIG Management |
| <a class="slurm_link" href="#MIG_Management"></a> |
| </h2> |
| |
| <p>Beginning in version 21.08, Slurm now supports NVIDIA |
| <i>Multi-Instance GPU</i> (MIG) devices. This feature allows some newer NVIDIA |
| GPUs (like the A100) to split up a GPU into up to seven separate, isolated GPU |
| instances. Slurm can treat these MIG instances as individual GPUs, complete with |
| cgroup isolation and task binding.</p> |
| |
| <p>To configure MIGs in Slurm, specify |
| <code class="commandline">AutoDetect=nvml</code> in <i>gres.conf</i> for the |
| nodes with MIGs, and specify <code class="commandline">Gres</code> |
| in <i>slurm.conf</i> as if the MIGs were regular GPUs, like this: |
| <code class="commandline">NodeName=tux[1-16] gres=gpu:2</code>. An optional |
| GRES type can be specified to distinguish MIGs of different sizes from each |
| other, as well as from other GPUs in the cluster. This type must be a substring |
| of the "MIG Profile" string as reported by the node in its slurmd log under the |
| <code class="commandline">debug2</code> log level. Here is an example slurm.conf |
| for a system with 2 gpus, one of which is partitioned into 2 MIGs where the |
| "MIG Profile" is <code class="commandline">nvidia_a100_3g.20gb</code>:</p> |
| <pre> |
| AccountingStorageTRES=gres/gpu,gres/gpu:a100,gres/gpu:a100_3g.20gb |
| GresTypes=gpu |
| NodeName=tux[1-16] gres=gpu:a100:1,gpu:a100_3g.20gb:2 |
| </pre> |
| |
| <p>The <a href="gres.conf.html#OPT_MultipleFiles">MultipleFiles</a> parameter |
| allows you to specify multiple device files for the GPU card.</p> |
| |
| <p>The sanity-check AutoDetect mode is not supported for MIGs. |
| Slurm expects MIG devices to already be partitioned, and does not support |
| dynamic MIG partitioning.</p> |
| |
| <p>For more information on NVIDIA MIGs (including how to partition them), see |
| <a href="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html"> |
| the MIG user guide</a>.</p> |
| |
| <h2 id="Sharding">Sharding |
| <a class="slurm_link" href="#Sharding"></a> |
| </h2> |
| |
| <p> |
| Sharding provides a generic mechanism where GPUs can be |
| shared by multiple jobs. While it does permit multiple jobs to run on a given |
| GPU it does not fence the processes running on the GPU, it only allows the GPU |
| to be shared. Sharding, therefore, works best with homogeneous workflows. It is |
| recommended to limit the number of shards on a node to equal the max possible |
| jobs that can run simultaneously on the node (i.e. cores). |
| The total count of shards available on a node should be configured in |
| the <I>slurm.conf</I> file (e.g. "NodeName=tux[1-16] Gres=gpu:2,shard:64"). |
| Several options are available for configuring shards in the <I>gres.conf</I> file |
| as listed below with examples following that:</p> |
| |
| <ol> |
| <li>No Shard configuration: The count of gres/shard elements defined in the |
| <I>slurm.conf</I> will be evenly distributed across all GPUs configured on the |
| node. For example, "NodeName=tux[1-16] Gres=gpu:2,shard:64" will configure |
| a count of 32 gres/shard resources on each of the two GPUs.</li> |
| <li>Shard configuration includes only the <I>Name</I> and <I>Count</I> parameters: |
| The count of gres/shard elements will be evenly distributed across all GPUs |
| configured on the node. This is similar to case 1, but places duplicate |
| configuration in the gres.conf file.</li> |
| <li>Shard configuration includes the <I>Name</I>, <I>File</I> and <I>Count</I> |
| parameters: Each <I>File</I> parameter should identify the device file path of a |
| GPU and the <I>Count</I> should identify the number of gres/shard resources |
| available for that specific GPU device. |
| This may be useful in a heterogeneous environment. |
| For example, some GPUs on a node may be more powerful than others and thus be |
| associated with a higher gres/shard count. |
| Another use case would be to prevent some GPUs from being used for sharding (i.e. |
| they would have a shard count of zero).</li> |
| </ol> |
| |
| <p>Note that <I>Type</I> and <I>Cores</I> parameters for gres/shard are ignored. |
| That information is copied from the gres/gpu configuration.</p> |
| |
| <p>Note that if NVIDIA's NVML library is installed, the GPU configuration |
| (i.e. <I>Type</I>, <I>File</I>, <I>Cores</I> and <I>Links</I> data) will be |
| automatically gathered from the library and need not be recorded in the |
| <I>gres.conf</I> file.</p> |
| |
| <p>Note the same GPU can be allocated either as a GPU type of GRES or as |
| a shard type of GRES, but not both. |
| In other words, once a GPU has been allocated as a gres/gpu resource it will |
| not be available as a gres/shard. |
| Likewise, once a GPU has been allocated as a gres/shard resource it will |
| not be available as a gres/gpu. |
| However the same GPU can be allocated as shard generic resources to multiple jobs |
| belonging to multiple users, so long as the total count of SHARD allocated to |
| jobs does not exceed the configured count.</p> |
| |
| <p>By default, job requests for shards are required to fit on a single gpu on |
| each node. This can be overridden with a flag in the <I>slurm.conf</I> |
| configuration file. See <a href="slurm.conf.html#OPT_MULTIPLE_SHARING_GRES_PJ"> |
| OPT_MULTIPLE_SHARING_GRES_PJ</a>.</p> |
| |
| <p>In order for this to be correctly configured, the appropriate nodes need |
| to have the <i>shard</i> keyword added as a GRES for the relevant nodes as |
| well as being added to the <i>GresTypes</i> parameter. If you want the shards |
| to be tracked in accounting then <i>shard</i> also needs to be added to |
| <i>AccountingStorageTRES</i>. |
| See the relevant settings in an example slurm.conf: |
| <pre> |
| AccountingStorageTRES=gres/gpu,gres/shard |
| GresTypes=gpu,shard |
| NodeName=tux[1-16] Gres=gpu:2,shard:64 |
| </pre> |
| |
| <p>Some example configurations for Slurm's gres.conf file are shown below.</p> |
| |
| <PRE> |
| # Example 1 of gres.conf |
| # Configure four GPUs (with Sharding) |
| AutoDetect=nvml |
| Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1 |
| Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1 |
| Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3 |
| Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3 |
| # Set gres/shard Count value to 8 on each of the 4 available GPUs |
| Name=shard Count=32 |
| </PRE> |
| |
| <a id="Shard_config_example_2"></a> |
| <PRE> |
| # Example 2 of gres.conf |
| # Configure four different GPU types (with Sharding) |
| AutoDetect=nvml |
| Name=gpu Type=gtx1080 File=/dev/nvidia0 Cores=0,1 |
| Name=gpu Type=gtx1070 File=/dev/nvidia1 Cores=0,1 |
| Name=gpu Type=gtx1060 File=/dev/nvidia2 Cores=2,3 |
| Name=gpu Type=gtx1050 File=/dev/nvidia3 Cores=2,3 |
| Name=shard Count=8 File=/dev/nvidia0 |
| Name=shard Count=8 File=/dev/nvidia1 |
| Name=shard Count=8 File=/dev/nvidia2 |
| Name=shard Count=8 File=/dev/nvidia3 |
| </PRE> |
| |
| <p><b>NOTE</b>: <i>gres/shard</i> requires the use of the <i>select/cons_tres</i> |
| plugin.</p> |
| |
| <p>Job requests for shards can not specify a GPU frequency.</p> |
| |
| <p>Jobs requesting shards resources will have the |
| <code class="commandline">CUDA_VISIBLE_DEVICES</code>, <code class="commandline">ROCR_VISIBLE_DEVICES</code>, |
| or <code class="commandline">GPU_DEVICE_ORDINAL</code> environment variable set |
| which would be the same as if it were a GPU. |
| </p> |
| |
| <p>Steps with shards have<code class="commandline">SLURM_SHARDS_ON_NODE</code> |
| set indicating the number of shards allocated.</p> |
| |
| <p style="text-align: center;">Last modified 10 April 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |