blob: 8daac3ef65ec6db5c1c5ab3c465af924394253e5 [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1><a name="top">Resource Selection Plugin Programmer Guide</a></h1>
<h2>Overview</h2>
<p>This document describes SLURM resource selection plugins and the API that defines
them. It is intended as a resource to programmers wishing to write their own SLURM
node selection plugins. This is version 100 of the API.</p>
<p>SLURM node selection plugins are SLURM plugins that implement the SLURM node selection
API described herein. They are intended to provide a mechanism for both selecting
nodes for pending jobs and performing any system-specific tasks for job launch or
termination. The plugins must conform to the SLURM Plugin API with the following
specifications:</p>
<p><span class="commandline">const char plugin_type[]</span><br>
The major type must be &quot;select.&quot; The minor type can be any recognizable
abbreviation for the type of node selection algorithm. We recommend, for example:</p>
<ul>
<li><b>linear</b>&#151;A plugin that selects nodes assuming a one-dimensional
array of nodes. The nodes are selected so as to minimize the number of consecutive
sets of nodes utilizing a best-fit algorithm. While supporting shared nodes,
this plugin does not allocate individual processors, but can allocate memory to jobs.
This plugin is recommended for systems without shared nodes.</li>
<li><b>cons_res</b>&#151;A plugin that can allocate individual processors,
memory, etc. within nodes. This plugin is recommended for systems with
many non-parallel programs sharing nodes. For more information see
<a href=cons_res.html>Consumable Resources in SLURM</a>.</li>
<li><b>bluegene</b>&#151;<a href="http://www.research.ibm.com/bluegene/">IBM Blue Gene</a>
node selector. Note that this plugin not only selects the nodes for a job, but performs
some initialization and termination functions for the job.</li>
</ul>
<p>The <span class="commandline">plugin_name</span> and
<span class="commandline">plugin_version</span>
symbols required by the SLURM Plugin API require no specialization for node selection support.
Note carefully, however, the versioning discussion below.</p>
<p>A simplified flow of logic follows:
<pre>
/* slurmctld daemon starts, recover state */
if ((<i>select_p_node_init)</i>() != SLURM_SUCCESS) ||
(<i>select_p_block_init)</i>() != SLURM_SUCCESS) ||
(<i>select_p_state_restore)</i>() != SLURM_SUCCESS) ||
(<i>select_p_job_init)</i>() != SLURM_SUCCESS))
abort
/* wait for job arrival */
if (<i>select_p_job_test</i>(all available nodes) != SLURM_SUCCESS) {
if (<i>select_p_job_test</i>(all configured nodes) != SLURM_SUCCESS)
/* reject the job and tell the user it can never run */
else
/* leave the job queued for later execution */
} else {
/* update job's node list and node bitmap */
if (<i>select_p_job_begin</i>() != SLURM_SUCCESS)
/* leave the job queued for later execution */
else {
while (!<i>select_p_job_ready</i>())
wait
/* execute the job */
/* wait for job to end or be terminated */
<i>select_p_job_fini</i>()
}
}
/* wait for slurmctld shutdown request */
<i>select_p_state_save</i>()
</pre>
<p>Depending upon failure modes, it is possible that
<span class="commandline">select_p_state_save()</span>
will not be called at slurmctld termination.
When slurmctld is restarted, other function calls may be replayed.
<span class="commandline">select_p_node_init()</span> may be used
to synchronize the plugin's state with that of slurmctld.</p>
<p class="footer"><a href="#top">top</a></p>
<h2>Data Objects</h2>
<p> These functions are expected to read and/or modify data structures directly in
the slurmctld daemon's memory. Slurmctld is a multi-threaded program with independent
read and write locks on each data structure type. Therefore the type of operations
permitted on various data structures is identified for each function.</p>
<p>These functions make use of bitmaps corresponding to the nodes in a table.
The function <span class="commandline">select_p_node_init()</span> should
be used to establish the initial mapping of bitmap entries to nodes.
Functions defined in <i>src/common/bitmap.h</i> should be used for bitmap
manipulations (these functions are directly accessible from the plugin).</p>
<p class="footer"><a href="#top">top</a></p>
<h2>API Functions</h2>
<p>The following functions must appear. Functions which are not implemented should
be stubbed.</p>
<h3>State Save Functions</h3>
<p class="commandline">int select_p_state_save (char *dir_name);</p>
<p style="margin-left:.2in"><b>Description</b>: Save any global node selection state
information to a file within the specified directory. The actual file name used is plugin specific.
It is recommended that the global switch state contain a magic number for validation purposes.
This function is called by the slurmctld deamon on shutdown.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<span class="commandline"> dir_name</span>&nbsp;
&nbsp;&nbsp;(input) fully-qualified pathname of a directory into which user SlurmUser (as defined
in slurm.conf) can create a file and write state information into that file. Cannot be NULL.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR.</p>
<p class="commandline">int select_p_state_restore (char *dir_name);</p>
<p style="margin-left:.2in"><b>Description</b>: Restore any global node selection state
information from a file within the specified directory. The actual file name used is plugin specific.
It is recommended that any magic number associated with the global switch state be verified.
This function is called by the slurmctld deamon on startup.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<span class="commandline"> dir_name</span>&nbsp;
&nbsp;&nbsp;(input) fully-qualified pathname of a directory containing a state information file
from which user SlurmUser (as defined in slurm.conf) can read. Cannot be NULL.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR, causing slurmctld to exit.</p>
<p class="footer"><a href="#top">top</a></p>
<h3>State Initialization Functions</h3>
<p class="commandline">int select_p_node_init (struct node_record *node_ptr, int node_cnt);</p>
<p style="margin-left:.2in"><b>Description</b>: Note the initialization of the node record data
structure. This function is called when the node records are initially established and again
when any nodes are added to or removed from the data structure. </p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<span class="commandline"> node_ptr</span>&nbsp;&nbsp;&nbsp;(input) pointer
to the node data records. Data in these records can read. Nodes deleted after initialization
may have their the <i>name</i> field in the record cleared (zero length) rather than
rebuilding the node records and bitmaps.<br><br>
<span class="commandline"> node_cnt</span>&nbsp; &nbsp;&nbsp;(input) number
of node data records.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR, causing slurmctld to exit.</p>
<p class="commandline">int select_p_block_init (List block_list);</p>
<p style="margin-left:.2in"><b>Description</b>: Note the initialization of the partition record data
structure. This function is called when the partition records are initially established and again
when any partition configurations change. </p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> part_list</span>&nbsp;&nbsp;&nbsp;(input) list of partition
record entries. Note that some of these partitions may have no associated nodes. Also
consider that nodes can be removed from one partition and added to a different partition.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR, causing slurmctld to exit.</p>
<p class="commandline">int select_p_job_init(List job_list);<p>
<p style="margin-left:.2in"><b>Description</b>: Used at slurm startup to
synchronize plugin (and node) state with that of currently active jobs.</p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> job_list</span>&nbsp; &nbsp;&nbsp;(input)
list of slurm jobs from slurmctld job records.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR.</p>
<p class="footer"><a href="#top">top</a></p>
<h3>State Synchronization Functions</h3>
<p class="commandline">int select_p_update_block (update_part_msg_t *part_desc_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: This function is called when the admin needs
to manually update the state of a block. </p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> part_desc_ptr</span>&nbsp;&nbsp;&nbsp;(input) partition
description variable. Containing the block name and the state to set the block.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR.</p>
<p class="commandline">int select_p_update_nodeinfo(struct node_record *node_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: Update plugin-specific information
related to the specified node. This is called after changes in a node's configuration.</p>
<p style="margin-left:.2in"><b>Argument</b>:
<span class="commandline"> node_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer
to the node for which information is requested.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR.</p>
<p class="commandline">int select_p_update_node_config (int index);</p>
<p style="margin-left:.2in"><b>Description</b>: note that a node has
registered with a different configuration than previously registered.
For example, the node was configured with 1GB of memory in slurm.conf,
but actually registered with 2GB of memory.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<span class="commandline"> index</span>&nbsp;&nbsp;&nbsp;(input) index
of the node in reference to the entire system.<br><br>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p>
<p class="commandline">int select_p_update_node_state (int index, uint16_t state);</p>
<p style="margin-left:.2in"><b>Description</b>: push a change of state
into the plugin the index should be the index from the slurmctld of
the entire system. The state should be the same state the node_record
was set to in the slurmctld.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<span class="commandline"> index</span>&nbsp;&nbsp;&nbsp;(input) index
of the node in reference to the entire system.<br><br>
<span class="commandline"> state</span>&nbsp;&nbsp;&nbsp;(input) new
state of the node.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p>
<p class="commandline">int select_p_update_sub_node (update_part_msg_t *part_desc_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: update the state of a portion of
a SLURM node. Currently used on BlueGene systems to place node cards within a
midplane into or out of an error state.</p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> part_desc_ptr</span>&nbsp;&nbsp;&nbsp;(input) partition
description variable. Containing the sub-block name and its new state.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p>
<p class="commandline">int select_p_alter_node_cnt (enum
select_node_cnt type, void *data);</p>
<p style="margin-left:.2in"><b>Description</b>: Used for systems like
a Bluegene system where SLURM sees 1 node where many nodes really
exists, in Bluegene's case 1 node reflects 512 nodes in real live, but
since usually 512 is the smallest allocatable block slurm only handles
it as 1 node. This is a function so the user can issue a 'real'
number and the function will alter it so slurm can understand what the
user really means in slurm terms.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<span class="commandline"> type</span>&nbsp;&nbsp;&nbsp;(input) enum
telling the plug in what the user is really wanting.<br><br>
<span class="commandline"> data</span>&nbsp;&nbsp;&nbsp;(input/output)
Is a void * so depending on the type sent in argument 1 this should
adjust the variable returning what the user is asking for.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p>
<p class="commandline">int select_p_reconfigure (void);</p>
<p style="margin-left:.2in"><b>Description</b>: Used to notify plugin
of change in partition configuration or general configuration change.
The plugin will test global variables for changes as appropriate.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p>
<p class="footer"><a href="#top">top</a></p>
<h3>Job-Specific Functions</h3>
<p class="commandline">int select_p_job_test (struct job_record *job_ptr,
bitstr_t *bitmap, int min_nodes, int max_nodes, int req_nodes, int mode,
List preemption_candidates, List *preempted_jobs);</p>
<p style="margin-left:.2in"><b>Description</b>: Given a job's scheduling requirement
specification and a set of nodes which might be used to satisfy the request, identify
the nodes which "best" satisfy the request. Note that nodes being considered for allocation
to the job may include nodes already allocated to other jobs, even if node sharing is
not permitted. This is done to ascertain whether or not job may be allocated resources
at some later time (when the other jobs complete). This permits SLURM to reject
non-runnable jobs at submit time rather than after they have spent hours queued.
Informing users of problems at job submission time permits them to quickly resubmit
the job with appropriate constraints.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<span class="commandline"> job_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer
to the job being considered for scheduling. Data in this job record may safely be read.
Data of particular interest include <i>details->contiguous</i> (set if allocated nodes
should be contiguous), <i>num_procs</i> (minimum processors in allocation) and
<i>details->req_node_bitmap</i> (specific required nodes).<br><br>
<span class="commandline"> bitmap</span>&nbsp; &nbsp;&nbsp;(input/output)
bits representing nodes which might be allocated to the job are set on input.
This function should clear the bits representing nodes not required to satisfy
job's scheduling request.
Bits left set will represent nodes to be used for this job. Note that the job's
required nodes (<i>details->req_node_bitmap</i>) will be a superset
<i>bitmap</i> when the function is called.<br><br>
<span class="commandline"> min_nodes</span>&nbsp; &nbsp;&nbsp;(input)
minimum number of nodes to allocate to this job. Note this reflects both job
and partition specifications.<br><br>
<span class="commandline"> max_nodes</span>&nbsp; &nbsp;&nbsp;(input)
maximum number of nodes to allocate to this job. Note this reflects both job
and partition specifications.<br><br>
<span class="commandline"> req_nodes</span>&nbsp; &nbsp;&nbsp;(input)
the requested (desired) of nodes to allocate to this job. This reflects job's
maximum node specification (if supplied).<br><br>
<span class="commandline"> mode</span>&nbsp; &nbsp;&nbsp;(input)
controls the mode of operation. Valid options are
SELECT_MODE_RUN_NOW: try to schedule job now<br>
SELECT_MODE_TEST_ONLY: test if job can ever run<br>
SELECT_MODE_WILL_RUN: determine when and where job can run<br><br>
<span class="commandline"> preemption_candidates</span>&nbsp; &nbsp;&nbsp;(input)
list of pointers to jobs which may be preempted in order to initiate this
pending job. May be NULL if there are no preemption candidates.<br><br>
<span class="commandline"> preempted_jobs</span>&nbsp; &nbsp;&nbsp;(input/output)
list of jobs which must be preempted in order to initiate the pending job.
If the value is NULL, no job list is returned.
If the list pointed to has a value of NULL, a new list will be created
otherwise the existing list will be overwritten.
Use the <i>list_destroy</i> function to destroy the list when no longer
needed.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and future attempts may be made to schedule
the job.</p>
<p class="commandline">int select_p_job_begin (struct job_record *job_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: Note the initiation of the specified job
is about to begin. This function is called immediately after
<span class="commandline">select_p_job_test()</span> successfully completes for this job.
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> job_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer
to the job being initialized. Data in this job record may safely be read or written.
The <i>nodes</i> and <i>node_bitmap</i> fields of this job record identify the
nodes which have already been selected for this job to use. For an example of
a job record field that the plugin may write into, see <i>select_id</i>.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR, which causes the job to be requeued for
later execution.</p>
<p class="commandline">int select_p_job_ready (struct job_record *job_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: Test if resources are configured
and ready for job execution. This function is only used in the job prolog for
BlueGene systems to determine if the bglblock has been booted and is ready for use.</p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> job_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer
to the job being initialized. Data in this job record may safely be read.
The <i>nodes</i> and <i>node_bitmap</i> fields of this job record identify the
nodes which have already been selected for this job to use. </p>
<p style="margin-left:.2in"><b>Returns</b>: 1 if the job may begin execution,
0 otherwise.</p>
<p class="commandline">int select_p_job_fini (struct job_record *job_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: Note the termination of the
specified job. This function is called as the termination process for the
job begins (prior to killing the tasks).</p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> job_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer
to the job being terminated. Data in this job record may safely be read or written.
The <i>nodes</i> and/or <i>node_bitmap</i> fields of this job record identify the
nodes which were selected for this job to use.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR.</p>
<p class="commandline">int select_p_job_suspend (struct job_record *job_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: Suspend the specified job.
Release resources for use by other jobs.</p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> job_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer
to the job being suspended. Data in this job record may safely be read or
written. The <i>nodes</i> and/or <i>node_bitmap</i> fields of this job record
identify the nodes which were selected for this job to use.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On
failure, the plugin should return a SLURM error code.</p>
<p class="commandline">int select_p_job_resume (struct job_record *job_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: Resume the specified job
which was previously suspended.</p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline"> job_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer
to the job being resumed. Data in this job record may safely be read or
written. The <i>nodes</i> and/or <i>node_bitmap</i> fields of this job record
identify the nodes which were selected for this job to use.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On
failure, the plugin should return a SLURM error code.</p>
<p class="footer"><a href="#top">top</a></p>
<h3>Get Information Functions</h3>
<p class="commandline">int select_p_get_info_from_plugin(enum select_data_info info,
struct job_record *job_ptr, void *data);</p>
<p style="margin-left:.2in"><b>Description</b>: Get plugin-specific information
about a job.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<span class="commandline"> info</span>&nbsp; &nbsp;&nbsp;(input) identifies
the type of data to be updated.<br><br>
<span class="commandline"> job_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer to
the job related to the query (if applicable; may be NULL).<br><br>
<span class="commandline"> data</span>&nbsp; &nbsp;&nbsp;(output) the requested data.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR.</p>
<p class="commandline">int select_p_pack_node_info (time_t last_query_time, Buf *buffer_ptr);</p>
<p style="margin-left:.2in"><b>Description</b>: Pack node specific information into a buffer.</p>
<p style="margin-left:.2in"><b>Arguments</b>:
<span class="commandline">
last_query_time</span>&nbsp;&nbsp;&nbsp;(input) time that the data was
last saved.<br>
<span class="commandline"> buffer_ptr</span>&nbsp;&nbsp;&nbsp;(input/output) buffer into
which the node data is appended.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful,
SLURM_NO_CHANGE_IN_DATA if data has not changed since last packed, otherwise SLURM_ERROR</p>
<p class="commandline">int select_p_get_select_nodeinfo(struct node_record *node_ptr,
enum select_data_info info, void *data);</p>
<p style="margin-left:.2in"><b>Description</b>: Get plugin-specific information
related to the specified node.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<span class="commandline"> node_ptr</span>&nbsp; &nbsp;&nbsp;(input) pointer
to the node for which information is requested.<br><br>
<span class="commandline"> info</span>&nbsp; &nbsp;&nbsp;(input) identifies
the type of data requested.<br><br>
<span class="commandline"> data</span>&nbsp; &nbsp;&nbsp;(output) the requested data.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR.</p>
<p class="footer"><a href="#top">top</a></p>
<h2>Versioning</h2>
<p> This document describes version 1 of the SLURM node selection API. Future
releases of SLURM may revise this API. A node selection plugin conveys its ability
to implement a particular API version using the mechanism outlined for SLURM plugins.
In addition, the credential is transmitted along with the version number of the
plugin that transmitted it. It is at the discretion of the plugin author whether
to maintain data format compatibility across different versions of the plugin.</p>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 5 October 2009</p>
<!--#include virtual="footer.txt"-->