doc/html/select_design.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1><a name="top">Select Plugin Design Guide</a></h1>

 <h2>Overview</h2>

 <p>The select plugin is responsible for selecting compute resources to be
 allocated to a job, plus allocating and deallocating those resources.
 The select plugin is aware of the systems topology, based upon data structures
 established by the topology plugn. It can also over-subscribe resources to
 support gang scheduling (time slicing of parallel jobs), if so configured.
 The select plugin is also capable of communicating with an external entity
 to perform these actions (the select/bluegene plugin used on an IBM BlueGene
 and the select/cray plugin used with Cray ALPS/BASIL software are two
 examples). Other architectures would rely upon either the select/linear or
 select/cons_res plugin. The select/linear plugin allocates whole nodes to jobs
 and is the simplest implementation. The select/cons_res plugin (<i>cons_res</i>
 is an abbreviation for <i>consumable resources</i>) can allocate individual
 sockets, cores, threads, or CPUs within a node. The select/cons_res plugin
 is slightly slower than select/linear, but contains far more complex logic.</p>

 <h2>Mode of Operation</h2>

 <p>The select/linear and select/cons_res plugins have similar modes of
 operation. The obvious difference is that data structures in select/linear
 are node-centric, while those in select/cons_res contain information at a
 finer resolution (sockets, cores, threads, or CPUs depending upon the
 SelectTypeParameters configuration parameter). The description below is
 generic and applies to both plugin implementations. Note that both plugins
 are able to manage memory allocations. Both plugins are also able to manage
 generic resource (GRES) allocations, making use of the GRES plugins.</p>

 <p>Per node data structures include memory (configured and allocated),
 GRES (configured and allocated, in a List data structure), plus a flag
 indicating if the node has been allocated using an exclusive option (preventing
 other jobs from being allocated resources on that same node). The other key
 data structure is used to enforce the per-partition <i>Shared</i> configuration
 parameter and tracks how many jobs have been allocated each resource in each
 partition. This data structure is different between the plugins based upon
 the resolution of the resource allocation (e.g. nodes or CPUs).</p>

 <p>Most of the logic in the select plugin is dedicated to identifying resources
 to be allocated to a new job. Input to that function includes: a pointer to the
 new job, a bitmap identifying nodes which could be used, node counts (minimum,
 maximum, and desired), a count of how many jobs of that partition the job can
 share resources with, and a list of jobs which can be preempted to initiate the
 new job. The first phase is to determine of all usable nodes, which nodes
 would best satisfy the resource requirement. This consistes of a best-fit
 algorithm that groups nodes based upon network topology (if the topology/tree
 plugin is configured) or based upon consecutive nodes (by default). Once the
 best nodes are identified, resources are accumulated for the new job until its
 resource requirements are satisfied.</p>

 <p>If the job can not be started with currently available resources, the plugin
 will attempt to identify jobs which can be preempted in order to initiate the
 new job. A copy of the current system state will be created including details
 about all resources and active jobs. Preemptable jobs will then be removed
 from this simulated system state until the new job can be initiated. When
 sufficient resources are available for the new job, the jobs actually needing
 to be preempted for its initiation will be preempted (this may be a subset of
 the jobs whose preemption is simulated).</p>

 <p>Other functions exist to support suspending jobs, resuming jobs, terminating
 jobs, expanding/shrinking job allocations, un/packing job state information,
 un/packing node state information, etc. The operation of those functions is
 relatively straightforward and not detailed here.</p>

 <h2>Operation on IBM BlueGene Systems</h2>

 <p>On IBM BlueGene systems, SLURM's <i>slurmd</i> daemon executes on the
 front-end nodes rather than the compute nodes and IBM provides a Bridge API
 to manage compute nodes and jobs. The IBM BlueGene systems also have very
 specific topology rules for what resources can be allocated to a job. SLURM's
 interface to IBM's Bridge API and the topology rules are found within the
 select/bluegene plugin and very little BlueGene-specific logic in SLURM is
 found outside of that plugin. Note that the select/bluegene plugin is used for
 BlueGene/L, BlueGene/P and BlueGene/Q systems with select portions of the
 code conditionally compiled depending upon the system type.</p>

 <h2>Operation on Cray Systems</h2>

 <p>The operation of the select/cray plugin is unique in that it does not
 directly select resources for a job, but uses the select/linear plugin for
 that purpose. It also interfaces with Cray's ALPS software using the BASIL
 interface or directly using the database. On Cray systems, SLURM's <i>slurmd</i>
 daemon executes on the front-end nodes rather than the compute nodes and
 ALPS is the mechanism available for SLURM to manage compute nodes and their
 jobs.</p>

 <pre>
            -------------------
            |   select/cray   |
            -------------------
               |           |
 -----------------   --------------
 | select/linear |   | BASIL/ALPS |
 -----------------   --------------
 </pre>

 <p class="footer"><a href="#top">top</a></p>

 <p style="text-align:center;">Last modified 31 May 2011</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1><a name="top">Select Plugin Design Guide</a></h1>

	<h2>Overview</h2>

	<p>The select plugin is responsible for selecting compute resources to be
	allocated to a job, plus allocating and deallocating those resources.
	The select plugin is aware of the systems topology, based upon data structures
	established by the topology plugn. It can also over-subscribe resources to
	support gang scheduling (time slicing of parallel jobs), if so configured.
	The select plugin is also capable of communicating with an external entity
	to perform these actions (the select/bluegene plugin used on an IBM BlueGene
	and the select/cray plugin used with Cray ALPS/BASIL software are two
	examples). Other architectures would rely upon either the select/linear or
	select/cons_res plugin. The select/linear plugin allocates whole nodes to jobs
	and is the simplest implementation. The select/cons_res plugin (<i>cons_res</i>
	is an abbreviation for <i>consumable resources</i>) can allocate individual
	sockets, cores, threads, or CPUs within a node. The select/cons_res plugin
	is slightly slower than select/linear, but contains far more complex logic.</p>

	<h2>Mode of Operation</h2>

	<p>The select/linear and select/cons_res plugins have similar modes of
	operation. The obvious difference is that data structures in select/linear
	are node-centric, while those in select/cons_res contain information at a
	finer resolution (sockets, cores, threads, or CPUs depending upon the
	SelectTypeParameters configuration parameter). The description below is
	generic and applies to both plugin implementations. Note that both plugins
	are able to manage memory allocations. Both plugins are also able to manage
	generic resource (GRES) allocations, making use of the GRES plugins.</p>

	<p>Per node data structures include memory (configured and allocated),
	GRES (configured and allocated, in a List data structure), plus a flag
	indicating if the node has been allocated using an exclusive option (preventing
	other jobs from being allocated resources on that same node). The other key
	data structure is used to enforce the per-partition <i>Shared</i> configuration
	parameter and tracks how many jobs have been allocated each resource in each
	partition. This data structure is different between the plugins based upon
	the resolution of the resource allocation (e.g. nodes or CPUs).</p>

	<p>Most of the logic in the select plugin is dedicated to identifying resources
	to be allocated to a new job. Input to that function includes: a pointer to the
	new job, a bitmap identifying nodes which could be used, node counts (minimum,
	maximum, and desired), a count of how many jobs of that partition the job can
	share resources with, and a list of jobs which can be preempted to initiate the
	new job. The first phase is to determine of all usable nodes, which nodes
	would best satisfy the resource requirement. This consistes of a best-fit
	algorithm that groups nodes based upon network topology (if the topology/tree
	plugin is configured) or based upon consecutive nodes (by default). Once the
	best nodes are identified, resources are accumulated for the new job until its
	resource requirements are satisfied.</p>

	<p>If the job can not be started with currently available resources, the plugin
	will attempt to identify jobs which can be preempted in order to initiate the
	new job. A copy of the current system state will be created including details
	about all resources and active jobs. Preemptable jobs will then be removed
	from this simulated system state until the new job can be initiated. When
	sufficient resources are available for the new job, the jobs actually needing
	to be preempted for its initiation will be preempted (this may be a subset of
	the jobs whose preemption is simulated).</p>

	<p>Other functions exist to support suspending jobs, resuming jobs, terminating
	jobs, expanding/shrinking job allocations, un/packing job state information,
	un/packing node state information, etc. The operation of those functions is
	relatively straightforward and not detailed here.</p>

	<h2>Operation on IBM BlueGene Systems</h2>

	<p>On IBM BlueGene systems, SLURM's <i>slurmd</i> daemon executes on the
	front-end nodes rather than the compute nodes and IBM provides a Bridge API
	to manage compute nodes and jobs. The IBM BlueGene systems also have very
	specific topology rules for what resources can be allocated to a job. SLURM's
	interface to IBM's Bridge API and the topology rules are found within the
	select/bluegene plugin and very little BlueGene-specific logic in SLURM is
	found outside of that plugin. Note that the select/bluegene plugin is used for
	BlueGene/L, BlueGene/P and BlueGene/Q systems with select portions of the
	code conditionally compiled depending upon the system type.</p>

	<h2>Operation on Cray Systems</h2>

	<p>The operation of the select/cray plugin is unique in that it does not
	directly select resources for a job, but uses the select/linear plugin for
	that purpose. It also interfaces with Cray's ALPS software using the BASIL
	interface or directly using the database. On Cray systems, SLURM's <i>slurmd</i>
	daemon executes on the front-end nodes rather than the compute nodes and
	ALPS is the mechanism available for SLURM to manage compute nodes and their
	jobs.</p>

	<pre>
	-------------------
	\| select/cray \|
	-------------------
	\| \|
	----------------- --------------
	\| select/linear \| \| BASIL/ALPS \|
	----------------- --------------
	</pre>

	<p class="footer"><a href="#top">top</a></p>

	<p style="text-align:center;">Last modified 31 May 2011</p>

	<!--#include virtual="footer.txt"-->