| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">Select Plugin Design Guide</a></h1> |
| |
| <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2> |
| |
| <p>The select plugin is responsible for selecting compute resources to be |
| allocated to a job, plus allocating and deallocating those resources. |
| The select plugin is aware of the systems topology, based upon data structures |
| established by the topology plugin. It can also over-subscribe resources to |
| support gang scheduling (time slicing of parallel jobs), if so configured. |
| Other architectures would rely upon the select/linear |
| or select/cons_tres plugins. The select/linear plugin allocates |
| whole nodes to jobs and is the simplest implementation. |
| The select/cons_tres plugin (<i>cons_tres</i> is an abbreviation for |
| <i>trackable resources</i>) can allocate individual sockets, cores, threads |
| or CPUs within a node. It also includes the ability to manage other generic |
| resources, such as GPUs. |
| The select/cons_tres plugin is slightly slower than |
| select/linear, but contains far more complex logic.</p> |
| |
| <h2 id="mode">Mode of Operation<a class="slurm_link" href="#mode"></a></h2> |
| |
| <p>The select/linear and select/cons_tres plugins have |
| similar modes of operation. The obvious difference is that data structures |
| in select/linear are node-centric, while those in |
| select/cons_tres contain information at a finer resolution (sockets, cores, |
| threads, or CPUs depending upon the SelectTypeParameters configuration |
| parameter). The description below is generic and applies to the above two |
| plugin implementations. Note that each of these plugins is able to manage |
| memory allocations. If you need to track other resources, such as GPUs, |
| you should use the select/cons_tres plugin.</p> |
| |
| <p>Per node data structures include memory (configured and allocated), |
| GRES (configured and allocated, in a List data structure), plus a flag |
| indicating if the node has been allocated using an exclusive option (preventing |
| other jobs from being allocated resources on that same node). The other key |
| data structure is used to enforce the per-partition <i>OverSubscribe</i> |
| configuration parameter and tracks how many jobs have been allocated each |
| compute resource (e.g. CPU) in each |
| partition. This data structure is different between the plugins based upon |
| the resolution of the resource allocation (e.g. nodes or CPUs).</p> |
| |
| <p>Most of the logic in the select plugin is dedicated to identifying resources |
| to be allocated to a new job. Input to that function includes: a pointer to the |
| new job, a bitmap identifying nodes which could be used, node counts (minimum, |
| maximum, and desired), a count of how many jobs of that partition the job can |
| share resources with, and a list of jobs which can be preempted to initiate the |
| new job. The first phase is to determine of all usable nodes, which nodes |
| would best satisfy the resource requirement. This consists of a best-fit |
| algorithm that groups nodes based upon network topology (if the topology/tree |
| plugin is configured) or based upon consecutive nodes (by default). Once the |
| best nodes are identified, resources are accumulated for the new job until its |
| resource requirements are satisfied.</p> |
| |
| <p>If the job can not be started with currently available resources, the plugin |
| will attempt to identify jobs which can be preempted in order to initiate the |
| new job. A copy of the current system state will be created including details |
| about all resources and active jobs. Preemptable jobs will then be removed |
| from this simulated system state until the new job can be initiated. When |
| sufficient resources are available for the new job, the jobs actually needing |
| to be preempted for its initiation will be preempted (this may be a subset of |
| the jobs whose preemption is simulated).</p> |
| |
| <p>Other functions exist to support suspending jobs, resuming jobs, terminating |
| jobs, shrinking job allocations, un/packing job state information, |
| un/packing node state information, etc. The operation of those functions is |
| relatively straightforward and not detailed here.</p> |
| |
| <p style="text-align:center;">Last modified 29 January 2024</p> |
| |
| <!--#include virtual="footer.txt"--> |