doc/html/gres.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Generic Resource (GRES) Scheduling</h1>

 <P>Generic resource (GRES) scheduling is supported through a flexible plugin
 mechanism. Support is currently provided for Graphics Processing Units (GPUs)
 and Intel&reg; Many Integrated Core (MIC) processors.</P>

 <!-------------------------------------------------------------------------->
 <h2>Configuration</h2>

 <P>SLURM supports no generic resources in the default configuration.
 One must explicitly specify which resources are to be managed in the
 <I>slurm.conf</I> configuration file. The configuration parameters of
 interest are:</P>

 <UL>
 <LI><B>GresTypes</B> a comma delimited list of generic resources to be
 managed (e.g. <I>GresTypes=gpu,mic</I>). This name may be that of an
 optional plugin providing additional control over the resources.</LI>
 <LI><B>Gres</B> the specific generic resource and their count associated with
 each node (e.g. <I>NodeName=linux[0-999] Gres=gpu:1,mic:2</I>).</LI>
 </UL>

 <P>Note that the Gres specification for each node works in the same fashion
 as the other resources managed. Depending upon the value of the
 <I>FastSchedule</I> parameter, nodes which are found to have fewer resources
 than configured will be placed in a DOWN state.</P>

 <P>Note that the Gres specification is not supported on BlueGene systems.</P>

 <P>Each compute node with generic resources must also contain a <I>gres.conf</I>
 file describing which resources are available on the node, their count,
 associated device files and CPUs which should be used with those resources.
 The configuration parameters available are:</P>

 <UL>
 <LI><B>Name</B> name of a generic resource (must match <B>GresTypes</B>
 values in <I>slurm.conf</I> ).</LI>

 <LI><B>Count</B> Number of resources of this type available on this node.
 The default value is set to the number of <B>File</B> values specified (if any),
 otherwise the default value is one. A suffix of "K", "M" or "G" may be used
 to multiply the number by 1024, 1048576 or 1073741824 respectively
 (e.g. "Count=10G"). Note that Count is a 32-bit field and the maximum value
 is 4,294,967,295.</LI>

 <LI><B>CPUs</B> Specify the CPU index numbers for the specific CPUs which can
 use this resources. For example, it may be strongly preferable to use specific
 CPUs with specific devices (e.g. on a NUMA architecture).
 Multiple CPUs may be specified using a comma delimited list or a range may be
 specified using a "-" separator (e.g. "0,1,2,3" or "0-3").
 If not specified, then any CPU can be used with the resources.
 If any CPU can be used with the resources, then do not specify the
 CPUs option for improved speed in the SLURM scheduling logic.

 <LI><B>File</B> Fully qualified pathname of the device files associated with a
 resource.
 The name can include a numberic range suffix to be interpreted by SLURM
 (e.g. <I>File=/dev/nvidia[0-3]</I>).
 This field is generally required if enforcement of generic resource
 allocations is to be supported (i.e. prevents a users from making
 use of resources allocated to a different user).
 If File is specified then Count must be either set to the number
 of file names specified or not set (the default value is the number of files
 specified).
 NOTE: If you specify the File parameter for a resource on some node,
 the option must be specified on all nodes and SLURM will track the assignment
 of each specific resource on each node. Otherwise SLURM will only track a
 count of allocated resources rather than the state of each individual device
 file.</LI>
 </UL>

 <P>Sample gres.conf file:</P>
 <PRE>
 # Configure support for our four GPUs
 Name=gpu File=/dev/nvidia0 CPUs=0,1
 Name=gpu File=/dev/nvidia1 CPUs=0,1
 Name=gpu File=/dev/nvidia2 CPUs=2,3
 Name=gpu File=/dev/nvidia3 CPUs=2,3
 Name=bandwidth Count=20M
 </PRE>
 <!-------------------------------------------------------------------------->
 <h2>Running Jobs</h2>

 <P>Jobs will not be allocated any generic resources unless specifically
 requested at job submit time using the <I>--gres</I> option supported by
 the <I>salloc</I>, <I>sbatch</I> and <I>srun</I> commands. The option
 requires an argument specifying which generic resources are required and
 how many resources. The resource specification is of the form
 <I>name[:count]</I>. The <I>name</I> is the same name as
 specified by the <I>GresTypes</I> and <I>Gres</I> configuration parameters.
 <I>count</I> specifies how many resources are required and has a default
 value of 1. For example:<BR>
 <I>sbatch --gres=gpu:2 ...</I>.</P>

 <P>Jobs will be allocated specific generic resources as needed to satisfy
 the request. If the job is suspended, those resources do not become available
 for use by other jobs.</P>

 <P>Job steps can be allocated generic resources from those allocated to the
 job using the <I>--gres</I> option with the <I>srun</I> command as described
 above. By default, a job step will be allocated all of the generic resources
 allocated to the job. If desired, the job step may explicitly specify a
 different generic resource count than the job.
 This design choice was based upon a scenario where each job executes many
 job steps. If job steps were granted access to all generic resources by
 default, some job steps would need to explicitly specify zero generic resource
 counts, which we considered more confusing. The job step can be allocated
 specific generic resources and those resources will not be available to other
 job steps. A simple example is shown below.</P>

 <PRE>
 #!/bin/bash
 #
 # gres_test.bash
 # Submit as follows:
 # sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
 #
 srun --gres=gpu:2 -n2 --exclusive show_device.sh &
 srun --gres=gpu:1 -n1 --exclusive show_device.sh &
 srun --gres=gpu:1 -n1 --exclusive show_device.sh &
 wait
 </PRE>

 <!-------------------------------------------------------------------------->
 <h2>GPU Management</h2>

 <P>In the case of SLURM's GRES plugin for GPUs, the environment variable
 CUDA_VISIBLE_DEVICES is set for each job step to determine which GPUs are
 available for its use on each node. This environment variable is only set
 when tasks are launched on a specific compute node (no global environment
 variable is set for the <i>salloc</i> command and the environment variable set
 for the <i>sbatch</i> command only reflects the GPUs allocated to that job
 on that node, node zero of the allocation).
 CUDA version 3.1 (or higher) uses this environment
 variable in order to run multiple jobs or job steps on a node with GPUs
 and insure that the resources assigned to each are unique. In the example
 above, the allocated node may have four or more graphics devices. In that
 case, CUDA_VISIBLE_DEVICES will reference unique devices for each file and
 the output might resemble this:</P>

 <PRE>
 JobStep=1234.0 CUDA_VISIBLE_DEVICES=0,1
 JobStep=1234.1 CUDA_VISIBLE_DEVICES=2
 JobStep=1234.2 CUDA_VISIBLE_DEVICES=3
 </PRE>

 <P>NOTE: Be sure to specify the <I>File</I> parameters in the <I>gres.conf</I>
 file and insure they are in the increasing numeric order.</P>
 <!-------------------------------------------------------------------------->
 <h2>MIC Management</h2>

 <P>SLURM can be used to provide resource management for systems with the
 Intel&reg; Many Integrated Core (MIC) processor.
 SLURM sets an OFFLOAD_DEVICES environment variable, which controls the
 selection of MICs available to a job step.
 The OFFLOAD_DEVICES environment variable is used by both Intel
 LEO (Language Extensioins for Offload) and the MKL (Math Kernel Library)
 automatic offload.
 (This is very similar to how the CUDA_VISIBLE_DEVICES environment variable is
 used to control which GPUs can be used by CUDA&trade; software.)
 If no MICs are reserved via GRES, the OFFLOAD_DEVICES variable is set to
 -1. This causes the code to ignore the offload directives and run MKL
 routines on the CPU. The code will still run but only on the CPU. This
 also gives a somewhat cryptic warning:</P>
 <pre>offload warning: OFFLOAD_DEVICES device number -1 does not correspond
 to a physical device</pre>
 <P>The offloading is automatically scaled to all the devices, (e.g. if
 --gres=mic:2 is defined) then all offloads use two MICs unless
 explicitly defined in the offload pragmas.</P>
 <!-------------------------------------------------------------------------->

 <p style="text-align: center;">Last modified 25 October 2012</p>

 </body></html>
	<!--#include virtual="header.txt"-->

	<h1>Generic Resource (GRES) Scheduling</h1>

	<P>Generic resource (GRES) scheduling is supported through a flexible plugin
	mechanism. Support is currently provided for Graphics Processing Units (GPUs)
	and Intel® Many Integrated Core (MIC) processors.</P>

	<!-------------------------------------------------------------------------->
	<h2>Configuration</h2>

	<P>SLURM supports no generic resources in the default configuration.
	One must explicitly specify which resources are to be managed in the
	<I>slurm.conf</I> configuration file. The configuration parameters of
	interest are:</P>

	<UL>
	<LI><B>GresTypes</B> a comma delimited list of generic resources to be
	managed (e.g. <I>GresTypes=gpu,mic</I>). This name may be that of an
	optional plugin providing additional control over the resources.</LI>
	<LI><B>Gres</B> the specific generic resource and their count associated with
	each node (e.g. <I>NodeName=linux[0-999] Gres=gpu:1,mic:2</I>).</LI>
	</UL>

	<P>Note that the Gres specification for each node works in the same fashion
	as the other resources managed. Depending upon the value of the
	<I>FastSchedule</I> parameter, nodes which are found to have fewer resources
	than configured will be placed in a DOWN state.</P>

	<P>Note that the Gres specification is not supported on BlueGene systems.</P>

	<P>Each compute node with generic resources must also contain a <I>gres.conf</I>
	file describing which resources are available on the node, their count,
	associated device files and CPUs which should be used with those resources.
	The configuration parameters available are:</P>

	<UL>
	<LI><B>Name</B> name of a generic resource (must match <B>GresTypes</B>
	values in <I>slurm.conf</I> ).</LI>

	<LI><B>Count</B> Number of resources of this type available on this node.
	The default value is set to the number of <B>File</B> values specified (if any),
	otherwise the default value is one. A suffix of "K", "M" or "G" may be used
	to multiply the number by 1024, 1048576 or 1073741824 respectively
	(e.g. "Count=10G"). Note that Count is a 32-bit field and the maximum value
	is 4,294,967,295.</LI>

	<LI><B>CPUs</B> Specify the CPU index numbers for the specific CPUs which can
	use this resources. For example, it may be strongly preferable to use specific
	CPUs with specific devices (e.g. on a NUMA architecture).
	Multiple CPUs may be specified using a comma delimited list or a range may be
	specified using a "-" separator (e.g. "0,1,2,3" or "0-3").
	If not specified, then any CPU can be used with the resources.
	If any CPU can be used with the resources, then do not specify the
	CPUs option for improved speed in the SLURM scheduling logic.

	<LI><B>File</B> Fully qualified pathname of the device files associated with a
	resource.
	The name can include a numberic range suffix to be interpreted by SLURM
	(e.g. <I>File=/dev/nvidia[0-3]</I>).
	This field is generally required if enforcement of generic resource
	allocations is to be supported (i.e. prevents a users from making
	use of resources allocated to a different user).
	If File is specified then Count must be either set to the number
	of file names specified or not set (the default value is the number of files
	specified).
	NOTE: If you specify the File parameter for a resource on some node,
	the option must be specified on all nodes and SLURM will track the assignment
	of each specific resource on each node. Otherwise SLURM will only track a
	count of allocated resources rather than the state of each individual device
	file.</LI>
	</UL>

	<P>Sample gres.conf file:</P>
	<PRE>
	# Configure support for our four GPUs
	Name=gpu File=/dev/nvidia0 CPUs=0,1
	Name=gpu File=/dev/nvidia1 CPUs=0,1
	Name=gpu File=/dev/nvidia2 CPUs=2,3
	Name=gpu File=/dev/nvidia3 CPUs=2,3
	Name=bandwidth Count=20M
	</PRE>
	<!-------------------------------------------------------------------------->
	<h2>Running Jobs</h2>

	<P>Jobs will not be allocated any generic resources unless specifically
	requested at job submit time using the <I>--gres</I> option supported by
	the <I>salloc</I>, <I>sbatch</I> and <I>srun</I> commands. The option
	requires an argument specifying which generic resources are required and
	how many resources. The resource specification is of the form
	<I>name[:count]</I>. The <I>name</I> is the same name as
	specified by the <I>GresTypes</I> and <I>Gres</I> configuration parameters.
	<I>count</I> specifies how many resources are required and has a default
	value of 1. For example:<BR>
	<I>sbatch --gres=gpu:2 ...</I>.</P>

	<P>Jobs will be allocated specific generic resources as needed to satisfy
	the request. If the job is suspended, those resources do not become available
	for use by other jobs.</P>

	<P>Job steps can be allocated generic resources from those allocated to the
	job using the <I>--gres</I> option with the <I>srun</I> command as described
	above. By default, a job step will be allocated all of the generic resources
	allocated to the job. If desired, the job step may explicitly specify a
	different generic resource count than the job.
	This design choice was based upon a scenario where each job executes many
	job steps. If job steps were granted access to all generic resources by
	default, some job steps would need to explicitly specify zero generic resource
	counts, which we considered more confusing. The job step can be allocated
	specific generic resources and those resources will not be available to other
	job steps. A simple example is shown below.</P>

	<PRE>
	#!/bin/bash
	#
	# gres_test.bash
	# Submit as follows:
	# sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
	#
	srun --gres=gpu:2 -n2 --exclusive show_device.sh &
	srun --gres=gpu:1 -n1 --exclusive show_device.sh &
	srun --gres=gpu:1 -n1 --exclusive show_device.sh &
	wait
	</PRE>

	<!-------------------------------------------------------------------------->
	<h2>GPU Management</h2>

	<P>In the case of SLURM's GRES plugin for GPUs, the environment variable
	CUDA_VISIBLE_DEVICES is set for each job step to determine which GPUs are
	available for its use on each node. This environment variable is only set
	when tasks are launched on a specific compute node (no global environment
	variable is set for the <i>salloc</i> command and the environment variable set
	for the <i>sbatch</i> command only reflects the GPUs allocated to that job
	on that node, node zero of the allocation).
	CUDA version 3.1 (or higher) uses this environment
	variable in order to run multiple jobs or job steps on a node with GPUs
	and insure that the resources assigned to each are unique. In the example
	above, the allocated node may have four or more graphics devices. In that
	case, CUDA_VISIBLE_DEVICES will reference unique devices for each file and
	the output might resemble this:</P>

	<PRE>
	JobStep=1234.0 CUDA_VISIBLE_DEVICES=0,1
	JobStep=1234.1 CUDA_VISIBLE_DEVICES=2
	JobStep=1234.2 CUDA_VISIBLE_DEVICES=3
	</PRE>

	<P>NOTE: Be sure to specify the <I>File</I> parameters in the <I>gres.conf</I>
	file and insure they are in the increasing numeric order.</P>
	<!-------------------------------------------------------------------------->
	<h2>MIC Management</h2>

	<P>SLURM can be used to provide resource management for systems with the
	Intel® Many Integrated Core (MIC) processor.
	SLURM sets an OFFLOAD_DEVICES environment variable, which controls the
	selection of MICs available to a job step.
	The OFFLOAD_DEVICES environment variable is used by both Intel
	LEO (Language Extensioins for Offload) and the MKL (Math Kernel Library)
	automatic offload.
	(This is very similar to how the CUDA_VISIBLE_DEVICES environment variable is
	used to control which GPUs can be used by CUDA™ software.)
	If no MICs are reserved via GRES, the OFFLOAD_DEVICES variable is set to
	-1. This causes the code to ignore the offload directives and run MKL
	routines on the CPU. The code will still run but only on the CPU. This
	also gives a somewhat cryptic warning:</P>
	<pre>offload warning: OFFLOAD_DEVICES device number -1 does not correspond
	to a physical device</pre>
	<P>The offloading is automatically scaled to all the devices, (e.g. if
	--gres=mic:2 is defined) then all offloads use two MICs unless
	explicitly defined in the offload pragmas.</P>
	<!-------------------------------------------------------------------------->

	<p style="text-align: center;">Last modified 25 October 2012</p>

	</body></html>