| <!--#include virtual="header.txt"--> |
| |
| <h1>Generic Resource (GRES) Scheduling</h1> |
| |
| <P>Beginning in SLURM version 2.2 generic resource (Gres) scheduling is |
| supported through a flexible plugin mechanism. Support is initially provided |
| for Graphics Processing Units (GPUs), although support for any resources is |
| possible.</P> |
| |
| <!--------------------------------------------------------------------------> |
| <h2>Configuration</h2> |
| |
| <P>SLURM supports no generic resourses in the default configuration. |
| One must explicitly specify which resources are to be managed in the |
| <I>slurm.conf</I> configuration file. The configuration parameters of |
| interest are:</P> |
| |
| <UL> |
| <LI><B>GresTypes</B> a comma delimited list of generic resources to be |
| managed (e.g. <I>GresTypes=gpu,nic</I>). This name may be that of an |
| optional plugin providing additional control over the resources.</LI> |
| <LI><B>Gres</B> the specific generic resource and their count associated with |
| each node (e.g. <I>NodeName=linux[0-999] Gres=gpu:8,nic:2</I>).</LI> |
| </UL> |
| |
| <P>Note that the Gres specification for each node works in the same fashion |
| as the other resources managed. Depending upon the value of the |
| <I>FastSchedule</I> parameter, nodes which are found to have fewer resources |
| than configured will be placed in a DOWN state.</P> |
| |
| <P>Note that the Gres specification is not supported on BlueGene systems.</P> |
| |
| <P>Each compute node with generic resources must also contain a <I>gres.conf</I> |
| file describing which resources are available on the node, their count, |
| associated device files and CPUs which should be used with those resources. |
| The configuration parameters available are:</P> |
| |
| <UL> |
| <LI><B>Name</B> name of a generic resource (must match <B>GresTypes</B> |
| values in <I>slurm.conf</I> ).</LI> |
| |
| <LI><B>Count</B> Number of resources of this type available on this node. |
| The default value is set to the number of <B>File</B> values specified (if any), |
| otherwise the default value is one. A suffix of "K", "M" or "G" may be used |
| to mulitply the number by 1024, 1048576 or 1073741824 respectively |
| (e.g. "Count=10G"). Note that Count is a 32-bit field and the maximum value |
| is 4,294,967,295.</LI> |
| |
| <LI><B>CPUs</B> Specify the CPU index numbers for the specific CPUs which can |
| use this resources. For example, it may be strongly preferable to use specific |
| CPUs with specific devices (e.g. on a NUMA architecture). |
| Multiple CPUs may be specified using a comma delimited list or a range may be |
| specified using a "-" separator (e.g. "0,1,2,3" or "0-3"). |
| If not specified, then any CPU can be used with the resources. |
| If any CPU can be used with the resources, then do not specify the |
| <B>CPUs</B> option for improved speed in the SLURM scheduling logic. |
| |
| <LI><B>File</B> Fully qualified pathname of the device files associated with a |
| resource. |
| The name can include a numberic range suffix to be interpretted by SLURM |
| (e.g. <I>File=/dev/nvidia[0-3]</I>). |
| This field is generally required if enforcement of generic resource |
| allocations is to be supported (i.e. prevents a users from making |
| use of resources allocated to a different user). |
| If <B>File</B> is specified then <B>Count</B> must be either set to the number |
| of file names specified or not set (the default value is the number of files |
| specified). |
| NOTE: If you specify the <B>File</B> parameter for a resource on some node, |
| the option must be specified on all nodes and SLURM will track the assignment |
| of each specific resource on each node. Otherwise SLURM will only track a |
| count of allocated resources rather than the state of each individual device |
| file.</LI> |
| </UL> |
| |
| <P>Sample gres.conf file:</P> |
| <PRE> |
| # Configure support for our four GPUs |
| Name=gpu File=/dev/nvidia0 CPUs=0,1 |
| Name=gpu File=/dev/nvidia1 CPUs=0,1 |
| Name=gpu File=/dev/nvidia2 CPUs=2,3 |
| Name=gpu File=/dev/nvidia3 CPUs=2,3 |
| Name=bandwidth Count=20M |
| </PRE> |
| <!--------------------------------------------------------------------------> |
| <h2>Running Jobs</h2> |
| |
| <P>Jobs will not be allocated any generic resources unless specifically |
| requested at job submit time using the <I>--gres</I> option supported by |
| the <I>salloc</I>, <I>sbatch</I> and <I>srun</I> commands. The option |
| requires an argument specifying which generic resources are required and |
| how many resources. The resource specification is of the form |
| <I>name[:count]</I>. The <I>name</I> is the same name as |
| specified by the <I>GresTypes</I> and <I>Gres</I> configuration parameters. |
| <I>count</I> specifies how many resources are required and has a default |
| value of 1. For example:<BR> |
| <I>sbatch --gres=gpu:2 ...</I>.</P> |
| |
| <P>Jobs will be allocated specific generic resources as needed to satisfy |
| the request. If the job is suspended, those resources do not become available |
| for use by other jobs.</P> |
| |
| <P>Job steps can be allocated generic resources from those allocated to the |
| job using the <I>--gres</I> option with the <I>srun</I> command as described |
| above. By default, a job step will be allocated all of the generic resources |
| allocated to the job. If desired, the job step may explicitly specify a |
| different generic resource count than the job. |
| This design choice was based upon a scenario where each job executes many |
| job steps. If job steps were granted access to all generic resources by |
| default, some job steps would need to explicitly specify zero generic resource |
| counts, which we considered more confusing. The job step can be allocated |
| specific generic resources and those resources will not be available to other |
| job steps. A simple example is shown below.</P> |
| |
| <PRE> |
| #!/bin/bash |
| # |
| # gres_test.bash |
| # Submit as follows: |
| # sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash |
| # |
| srun --gres=gpu:2 -n2 --exclusive show_device.sh & |
| srun --gres=gpu:1 -n1 --exclusive show_device.sh & |
| srun --gres=gpu:1 -n1 --exclusive show_device.sh & |
| wait |
| </PRE> |
| |
| <!--------------------------------------------------------------------------> |
| <h2>GPU Management</h2> |
| |
| <P>In the case of SLURM's GRES plugin for GPUs, the environment variable |
| CUDA_VISIBLE_DEVICES is set for each job step to determine which GPUs are |
| available for its use on each node. This environment variable is only set |
| when tasks are launched on a specific compute node (no global environment |
| variable is set for the <i>salloc</i> command and the environment variable set |
| for the <i>sbatch</i> command only reflects the GPUs allocated to that job |
| on that node, node zero of the allocation). |
| CUDA version 3.1 (or higher) uses this environment |
| variable in order to run multiple jobs or job steps on a node with GPUs |
| and insure that the resources assigned to each are unique. In the example |
| above, the allocated node may have four or more graphics devices. In that |
| case, CUDA_VISIBLE_DEVICES will reference unique devices for each file and |
| the output might resemble this:</P> |
| |
| <PRE> |
| JobStep=1234.0 CUDA_VISIBLE_DEVICES=0,1 |
| JobStep=1234.1 CUDA_VISIBLE_DEVICES=2 |
| JobStep=1234.2 CUDA_VISIBLE_DEVICES=3 |
| </PRE> |
| |
| <P>NOTE: Be sure to specify the <I>File</I> parameters in the <I>gres.conf</I> |
| file and insure they are in the increasing numeric order.</P> |
| <!--------------------------------------------------------------------------> |
| |
| <p style="text-align: center;">Last modified 2 July 2012</p> |
| |
| </body></html> |