blob: b3c5535f92ba270e658ffe1bb65d02053fa17063 [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1>Generic Resource (GRES) Scheduling</h1>
<P>Beginning in SLURM version 2.2 generic resource (Gres) scheduling is
supported through a flexible plugin mechanism. Support is initially provided
for Graphics Processing Units (GPUs), although support for any resources is
possible.</P>
<!-------------------------------------------------------------------------->
<h2>Configuration</h2>
<P>SLURM supports no generic resourses in the default configuration.
One must explicitly specify which resources are to be managed in the
<I>slurm.conf</I> configuration file. The configuration parameters of
interest are:</P>
<UL>
<LI><B>GresTypes</B> a comma delimited list of generic resources to be
managed (e.g. <I>GresTypes=gpu,nic</I>). This name may be that of an
optional plugin providing additional control over the resources.</LI>
<LI><B>Gres</B> the specific generic resource and their count associated with
each node (e.g. <I>NodeName=linux[0-999] Gres=gpu:8,nic:2</I>).</LI>
</UL>
<P>Note that the Gres specification for each node works in the same fashion
as the other resources managed. Depending upon the value of the
<I>FastSchedule</I> parameter, nodes which are found to have fewer resources
than configured will be placed in a DOWN state.</P>
<P>Note that the Gres specification is not supported on BlueGene systems.</P>
<P>Each compute node with generic resources must also contain a <I>gres.conf</I>
file describing which resources are available on the node, their count,
associated device files and CPUs which should be used with those resources.
The configuration parameters available are:</P>
<UL>
<LI><B>Name</B> name of a generic resource (must match <B>GresTypes</B>
values in <I>slurm.conf</I> ).</LI>
<LI><B>Count</B> Number of resources of this type available on this node.
The default value is set to the number of <B>File</B> values specified (if any),
otherwise the default value is one. A suffix of "K", "M" or "G" may be used
to mulitply the number by 1024, 1048576 or 1073741824 respectively
(e.g. "Count=10G").</LI>
<LI><B>CPUs</B> Specify the CPU index numbers for the specific CPUs which can
use this resources. For example, it may be strongly preferable to use specific
CPUs with specific devices (e.g. on a NUMA architecture).
Multiple CPUs may be specified using a comma delimited list or a range may be
specified using a "-" separator (e.g. "0,1,2,3" or "0-3").
If not specified, then any CPU can be used with the resources.
If any CPU can be used with the resources, then do not specify the
<B>CPUs</B> option for improved speed in the SLURM scheduling logic.
<LI><B>File</B> Fully qualified pathname of the device files associated with a
resource.
The name can include a numberic range suffix to be interpretted by SLURM
(e.g. <I>File=/dev/nvidia[0-3]</I>).
This field is generally required if enforcement of generic resource
allocations is to be supported (i.e. prevents a users from making
use of resources allocated to a different user).
If <B>File</B> is specified then <B>Count</B> must be either set to the number
of file names specified or not set (the default value is the number of files
specified).
NOTE: If you specify the <B>File</B> parameter for a resource on some node,
the option must be specified on all nodes and SLURM will track the assignment
of each specific resource on each node. Otherwise SLURM will only track a
count of allocated resources rather than the state of each individual device
file.</LI>
</UL>
<P>Sample gres.conf file:</P>
<PRE>
# Configure support for our four GPUs
Name=gpu File=/dev/nvidia0 CPUs=0,1
Name=gpu File=/dev/nvidia1 CPUs=0,1
Name=gpu File=/dev/nvidia2 CPUs=2,3
Name=gpu File=/dev/nvidia3 CPUs=2,3
Name=bandwidth Count=20M
</PRE>
<!-------------------------------------------------------------------------->
<h2>Running Jobs</h2>
<P>Jobs will not be allocated any generic resources unless specifically
requested at job submit time using the <I>--gres</I> option supported by
the <I>salloc</I>, <I>sbatch</I> and <I>srun</I> commands. The option
requires an argument specifying which generic resources are required and
how many resources. The resource specification is of the form
<I>name[:count]</I>. The <I>name</I> is the same name as
specified by the <I>GresTypes</I> and <I>Gres</I> configuration parameters.
<I>count</I> specifies how many resources are required and has a default
value of 1. For example:<BR>
<I>sbatch --gres=gpu:2 ...</I>.</P>
<P>Jobs will be allocated specific generic resources as needed to satisfy
the request. If the job is suspended, those resources do not become available
for use by other jobs.</P>
<P>Job steps can be allocated generic resources from those allocated to the
job using the <I>--gres</I> option with the <I>srun</I> command as described
above. By default, a job step will be allocated all of the generic resources
allocated to the job. If desired, the job step may explicitly specify a
different generic resource count than the job.
This design choice was based upon a scenario where each job executes many
job steps. If job steps were granted access to all generic resources by
default, some job steps would need to explicitly specify zero generic resource
counts, which we considered more confusing. The job step can be allocated
specific generic resources and those resources will not be available to other
job steps. A simple example is shown below.</P>
<PRE>
#!/bin/bash
#
# gres_test.bash
# Submit as follows:
# sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
#
srun --gres=gpu:2 -n2 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
wait
</PRE>
<!-------------------------------------------------------------------------->
<h2>GPU Management</h2>
<P>In the case of SLURM's GRES plugin for GPUs, the environment variable
CUDA_VISIBLE_DEVICES is set for each job step to determine which GPUs are
available for its use on each node. This environment variable is only set
when tasks are launched on a specific compute node (no global environment
variable is set for the <i>salloc</i> command and the environment variable set
for the <i>sbatch</i> command only reflects the GPUs allocated to that job
on that node, node zero of the allocation).
CUDA version 3.1 (or higher) uses this environment
variable in order to run multiple jobs or job steps on a node with GPUs
and insure that the resources assigned to each are unique. In the example
above, the allocated node may have four or more graphics devices. In that
case, CUDA_VISIBLE_DEVICES will reference unique devices for each file and
the output might resemble this:</P>
<PRE>
JobStep=1234.0 CUDA_VISIBLE_DEVICES=0,1
JobStep=1234.1 CUDA_VISIBLE_DEVICES=2
JobStep=1234.2 CUDA_VISIBLE_DEVICES=3
</PRE>
<P>NOTE: Be sure to specify the <I>File</I> parameters in the <I>gres.conf</I>
file and insure they are in the increasing numeric order.</P>
<!-------------------------------------------------------------------------->
<p style="text-align: center;">Last modified 8 May 2012</p>
</body></html>