| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">Resource Selection Plugin Programmer Guide</a></h1> |
| |
| <h2>Overview</h2> |
| <p>This document describes SLURM resource selection plugins and the API that defines |
| them. It is intended as a resource to programmers wishing to write their own SLURM |
| node selection plugins. This is version 100 of the API.</p> |
| |
| <p>SLURM node selection plugins are SLURM plugins that implement the SLURM node selection |
| API described herein. They are intended to provide a mechanism for both selecting |
| nodes for pending jobs and performing any system-specific tasks for job launch or |
| termination. The plugins must conform to the SLURM Plugin API with the following |
| specifications:</p> |
| <p><span class="commandline">const char plugin_type[]</span><br> |
| The major type must be "select." The minor type can be any recognizable |
| abbreviation for the type of node selection algorithm. We recommend, for example:</p> |
| <ul> |
| <li><b>linear</b>—A plugin that selects nodes assuming a one-dimensional |
| array of nodes. The nodes are selected so as to minimize the number of consecutive |
| sets of nodes utilizing a best-fit algorithm. While supporting shared nodes, |
| this plugin does not allocate individual processors, but can allocate memory to jobs. |
| This plugin is recommended for systems without shared nodes.</li> |
| <li><b>cons_res</b>—A plugin that can allocate individual processors, |
| memory, etc. within nodes. This plugin is recommended for systems with |
| many non-parallel programs sharing nodes. For more information see |
| <a href=cons_res.html>Consumable Resources in SLURM</a>.</li> |
| <li><b>bluegene</b>—<a href="http://www.research.ibm.com/bluegene/">IBM Blue Gene</a> |
| node selector. Note that this plugin not only selects the nodes for a job, but performs |
| some initialization and termination functions for the job.</li> |
| </ul> |
| <p>The <span class="commandline">plugin_name</span> and |
| <span class="commandline">plugin_version</span> |
| symbols required by the SLURM Plugin API require no specialization for node selection support. |
| Note carefully, however, the versioning discussion below.</p> |
| |
| <p>A simplified flow of logic follows: |
| <pre> |
| /* slurmctld daemon starts, recover state */ |
| if ((<i>select_p_node_init)</i>() != SLURM_SUCCESS) || |
| (<i>select_p_block_init)</i>() != SLURM_SUCCESS) || |
| (<i>select_p_state_restore)</i>() != SLURM_SUCCESS) || |
| (<i>select_p_job_init)</i>() != SLURM_SUCCESS)) |
| abort |
| |
| /* wait for job arrival */ |
| if (<i>select_p_job_test</i>(all available nodes) != SLURM_SUCCESS) { |
| if (<i>select_p_job_test</i>(all configured nodes) != SLURM_SUCCESS) |
| /* reject the job and tell the user it can never run */ |
| else |
| /* leave the job queued for later execution */ |
| } else { |
| /* update job's node list and node bitmap */ |
| if (<i>select_p_job_begin</i>() != SLURM_SUCCESS) |
| /* leave the job queued for later execution */ |
| else { |
| while (!<i>select_p_job_ready</i>()) |
| wait |
| /* execute the job */ |
| /* wait for job to end or be terminated */ |
| <i>select_p_job_fini</i>() |
| } |
| } |
| |
| /* wait for slurmctld shutdown request */ |
| <i>select_p_state_save</i>() |
| </pre> |
| <p>Depending upon failure modes, it is possible that |
| <span class="commandline">select_p_state_save()</span> |
| will not be called at slurmctld termination. |
| When slurmctld is restarted, other function calls may be replayed. |
| <span class="commandline">select_p_node_init()</span> may be used |
| to synchronize the plugin's state with that of slurmctld.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>Data Objects</h2> |
| <p> These functions are expected to read and/or modify data structures directly in |
| the slurmctld daemon's memory. Slurmctld is a multi-threaded program with independent |
| read and write locks on each data structure type. Therefore the type of operations |
| permitted on various data structures is identified for each function.</p> |
| |
| <p>These functions make use of bitmaps corresponding to the nodes in a table. |
| The function <span class="commandline">select_p_node_init()</span> should |
| be used to establish the initial mapping of bitmap entries to nodes. |
| Functions defined in <i>src/common/bitmap.h</i> should be used for bitmap |
| manipulations (these functions are directly accessible from the plugin).</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>API Functions</h2> |
| <p>The following functions must appear. Functions which are not implemented should |
| be stubbed.</p> |
| |
| <h3>State Save Functions</h3> |
| |
| <p class="commandline">int select_p_state_save (char *dir_name);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Save any global node selection state |
| information to a file within the specified directory. The actual file name used is plugin specific. |
| It is recommended that the global switch state contain a magic number for validation purposes. |
| This function is called by the slurmctld deamon on shutdown.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<span class="commandline"> dir_name</span> |
| (input) fully-qualified pathname of a directory into which user SlurmUser (as defined |
| in slurm.conf) can create a file and write state information into that file. Cannot be NULL.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR.</p> |
| |
| <p class="commandline">int select_p_state_restore (char *dir_name);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Restore any global node selection state |
| information from a file within the specified directory. The actual file name used is plugin specific. |
| It is recommended that any magic number associated with the global switch state be verified. |
| This function is called by the slurmctld deamon on startup.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<span class="commandline"> dir_name</span> |
| (input) fully-qualified pathname of a directory containing a state information file |
| from which user SlurmUser (as defined in slurm.conf) can read. Cannot be NULL.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR, causing slurmctld to exit.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h3>State Initialization Functions</h3> |
| |
| <p class="commandline">int select_p_node_init (struct node_record *node_ptr, int node_cnt);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Note the initialization of the node record data |
| structure. This function is called when the node records are initially established and again |
| when any nodes are added to or removed from the data structure. </p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <span class="commandline"> node_ptr</span> (input) pointer |
| to the node data records. Data in these records can read. Nodes deleted after initialization |
| may have their the <i>name</i> field in the record cleared (zero length) rather than |
| rebuilding the node records and bitmaps.<br><br> |
| <span class="commandline"> node_cnt</span> (input) number |
| of node data records.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR, causing slurmctld to exit.</p> |
| |
| <p class="commandline">int select_p_block_init (List block_list);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Note the initialization of the partition record data |
| structure. This function is called when the partition records are initially established and again |
| when any partition configurations change. </p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> part_list</span> (input) list of partition |
| record entries. Note that some of these partitions may have no associated nodes. Also |
| consider that nodes can be removed from one partition and added to a different partition.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR, causing slurmctld to exit.</p> |
| |
| <p class="commandline">int select_p_job_init(List job_list);<p> |
| <p style="margin-left:.2in"><b>Description</b>: Used at slurm startup to |
| synchronize plugin (and node) state with that of currently active jobs.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> job_list</span> (input) |
| list of slurm jobs from slurmctld job records.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR.</p> |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h3>State Synchronization Functions</h3> |
| |
| <p class="commandline">int select_p_update_block (update_part_msg_t *part_desc_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: This function is called when the admin needs |
| to manually update the state of a block. </p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> part_desc_ptr</span> (input) partition |
| description variable. Containing the block name and the state to set the block.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR.</p> |
| |
| <p class="commandline">int select_p_update_nodeinfo(struct node_record *node_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Update plugin-specific information |
| related to the specified node. This is called after changes in a node's configuration.</p> |
| <p style="margin-left:.2in"><b>Argument</b>: |
| <span class="commandline"> node_ptr</span> (input) pointer |
| to the node for which information is requested.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR.</p> |
| |
| <p class="commandline">int select_p_update_node_config (int index);</p> |
| <p style="margin-left:.2in"><b>Description</b>: note that a node has |
| registered with a different configuration than previously registered. |
| For example, the node was configured with 1GB of memory in slurm.conf, |
| but actually registered with 2GB of memory.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <span class="commandline"> index</span> (input) index |
| of the node in reference to the entire system.<br><br> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p> |
| |
| <p class="commandline">int select_p_update_node_state (int index, uint16_t state);</p> |
| <p style="margin-left:.2in"><b>Description</b>: push a change of state |
| into the plugin the index should be the index from the slurmctld of |
| the entire system. The state should be the same state the node_record |
| was set to in the slurmctld.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <span class="commandline"> index</span> (input) index |
| of the node in reference to the entire system.<br><br> |
| <span class="commandline"> state</span> (input) new |
| state of the node.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p> |
| |
| <p class="commandline">int select_p_update_sub_node (update_part_msg_t *part_desc_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: update the state of a portion of |
| a SLURM node. Currently used on BlueGene systems to place node cards within a |
| midplane into or out of an error state.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> part_desc_ptr</span> (input) partition |
| description variable. Containing the sub-block name and its new state.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p> |
| |
| <p class="commandline">int select_p_alter_node_cnt (enum |
| select_node_cnt type, void *data);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Used for systems like |
| a Bluegene system where SLURM sees 1 node where many nodes really |
| exists, in Bluegene's case 1 node reflects 512 nodes in real live, but |
| since usually 512 is the smallest allocatable block slurm only handles |
| it as 1 node. This is a function so the user can issue a 'real' |
| number and the function will alter it so slurm can understand what the |
| user really means in slurm terms.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <span class="commandline"> type</span> (input) enum |
| telling the plug in what the user is really wanting.<br><br> |
| <span class="commandline"> data</span> (input/output) |
| Is a void * so depending on the type sent in argument 1 this should |
| adjust the variable returning what the user is asking for.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p> |
| |
| <p class="commandline">int select_p_reconfigure (void);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Used to notify plugin |
| of change in partition configuration or general configuration change. |
| The plugin will test global variables for changes as appropriate.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, otherwise SLURM_ERROR</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h3>Job-Specific Functions</h3> |
| |
| <p class="commandline">int select_p_job_test (struct job_record *job_ptr, |
| bitstr_t *bitmap, int min_nodes, int max_nodes, int req_nodes, int mode, |
| List preemption_candidates, List *preempted_jobs);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Given a job's scheduling requirement |
| specification and a set of nodes which might be used to satisfy the request, identify |
| the nodes which "best" satisfy the request. Note that nodes being considered for allocation |
| to the job may include nodes already allocated to other jobs, even if node sharing is |
| not permitted. This is done to ascertain whether or not job may be allocated resources |
| at some later time (when the other jobs complete). This permits SLURM to reject |
| non-runnable jobs at submit time rather than after they have spent hours queued. |
| Informing users of problems at job submission time permits them to quickly resubmit |
| the job with appropriate constraints.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <span class="commandline"> job_ptr</span> (input) pointer |
| to the job being considered for scheduling. Data in this job record may safely be read. |
| Data of particular interest include <i>details->contiguous</i> (set if allocated nodes |
| should be contiguous), <i>num_procs</i> (minimum processors in allocation) and |
| <i>details->req_node_bitmap</i> (specific required nodes).<br><br> |
| <span class="commandline"> bitmap</span> (input/output) |
| bits representing nodes which might be allocated to the job are set on input. |
| This function should clear the bits representing nodes not required to satisfy |
| job's scheduling request. |
| Bits left set will represent nodes to be used for this job. Note that the job's |
| required nodes (<i>details->req_node_bitmap</i>) will be a superset |
| <i>bitmap</i> when the function is called.<br><br> |
| <span class="commandline"> min_nodes</span> (input) |
| minimum number of nodes to allocate to this job. Note this reflects both job |
| and partition specifications.<br><br> |
| <span class="commandline"> max_nodes</span> (input) |
| maximum number of nodes to allocate to this job. Note this reflects both job |
| and partition specifications.<br><br> |
| <span class="commandline"> req_nodes</span> (input) |
| the requested (desired) of nodes to allocate to this job. This reflects job's |
| maximum node specification (if supplied).<br><br> |
| <span class="commandline"> mode</span> (input) |
| controls the mode of operation. Valid options are |
| SELECT_MODE_RUN_NOW: try to schedule job now<br> |
| SELECT_MODE_TEST_ONLY: test if job can ever run<br> |
| SELECT_MODE_WILL_RUN: determine when and where job can run<br><br> |
| <span class="commandline"> preemption_candidates</span> (input) |
| list of pointers to jobs which may be preempted in order to initiate this |
| pending job. May be NULL if there are no preemption candidates.<br><br> |
| <span class="commandline"> preempted_jobs</span> (input/output) |
| list of jobs which must be preempted in order to initiate the pending job. |
| If the value is NULL, no job list is returned. |
| If the list pointed to has a value of NULL, a new list will be created |
| otherwise the existing list will be overwritten. |
| Use the <i>list_destroy</i> function to destroy the list when no longer |
| needed.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR and future attempts may be made to schedule |
| the job.</p> |
| |
| <p class="commandline">int select_p_job_begin (struct job_record *job_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Note the initiation of the specified job |
| is about to begin. This function is called immediately after |
| <span class="commandline">select_p_job_test()</span> successfully completes for this job. |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> job_ptr</span> (input) pointer |
| to the job being initialized. Data in this job record may safely be read or written. |
| The <i>nodes</i> and <i>node_bitmap</i> fields of this job record identify the |
| nodes which have already been selected for this job to use. For an example of |
| a job record field that the plugin may write into, see <i>select_id</i>.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR, which causes the job to be requeued for |
| later execution.</p> |
| |
| <p class="commandline">int select_p_job_ready (struct job_record *job_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Test if resources are configured |
| and ready for job execution. This function is only used in the job prolog for |
| BlueGene systems to determine if the bglblock has been booted and is ready for use.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> job_ptr</span> (input) pointer |
| to the job being initialized. Data in this job record may safely be read. |
| The <i>nodes</i> and <i>node_bitmap</i> fields of this job record identify the |
| nodes which have already been selected for this job to use. </p> |
| <p style="margin-left:.2in"><b>Returns</b>: 1 if the job may begin execution, |
| 0 otherwise.</p> |
| |
| <p class="commandline">int select_p_job_fini (struct job_record *job_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Note the termination of the |
| specified job. This function is called as the termination process for the |
| job begins (prior to killing the tasks).</p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> job_ptr</span> (input) pointer |
| to the job being terminated. Data in this job record may safely be read or written. |
| The <i>nodes</i> and/or <i>node_bitmap</i> fields of this job record identify the |
| nodes which were selected for this job to use.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR.</p> |
| |
| <p class="commandline">int select_p_job_suspend (struct job_record *job_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Suspend the specified job. |
| Release resources for use by other jobs.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> job_ptr</span> (input) pointer |
| to the job being suspended. Data in this job record may safely be read or |
| written. The <i>nodes</i> and/or <i>node_bitmap</i> fields of this job record |
| identify the nodes which were selected for this job to use.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On |
| failure, the plugin should return a SLURM error code.</p> |
| |
| <p class="commandline">int select_p_job_resume (struct job_record *job_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Resume the specified job |
| which was previously suspended.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> job_ptr</span> (input) pointer |
| to the job being resumed. Data in this job record may safely be read or |
| written. The <i>nodes</i> and/or <i>node_bitmap</i> fields of this job record |
| identify the nodes which were selected for this job to use.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On |
| failure, the plugin should return a SLURM error code.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h3>Get Information Functions</h3> |
| |
| <p class="commandline">int select_p_get_info_from_plugin(enum select_data_info info, |
| struct job_record *job_ptr, void *data);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Get plugin-specific information |
| about a job.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <span class="commandline"> info</span> (input) identifies |
| the type of data to be updated.<br><br> |
| <span class="commandline"> job_ptr</span> (input) pointer to |
| the job related to the query (if applicable; may be NULL).<br><br> |
| <span class="commandline"> data</span> (output) the requested data.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR.</p> |
| |
| <p class="commandline">int select_p_pack_node_info (time_t last_query_time, Buf *buffer_ptr);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Pack node specific information into a buffer.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>: |
| <span class="commandline"> |
| last_query_time</span> (input) time that the data was |
| last saved.<br> |
| <span class="commandline"> buffer_ptr</span> (input/output) buffer into |
| which the node data is appended.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful, |
| SLURM_NO_CHANGE_IN_DATA if data has not changed since last packed, otherwise SLURM_ERROR</p> |
| |
| <p class="commandline">int select_p_get_select_nodeinfo(struct node_record *node_ptr, |
| enum select_data_info info, void *data);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Get plugin-specific information |
| related to the specified node.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <span class="commandline"> node_ptr</span> (input) pointer |
| to the node for which information is requested.<br><br> |
| <span class="commandline"> info</span> (input) identifies |
| the type of data requested.<br><br> |
| <span class="commandline"> data</span> (output) the requested data.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>Versioning</h2> |
| <p> This document describes version 1 of the SLURM node selection API. Future |
| releases of SLURM may revise this API. A node selection plugin conveys its ability |
| to implement a particular API version using the mechanism outlined for SLURM plugins. |
| In addition, the credential is transmitted along with the version number of the |
| plugin that transmitted it. It is at the discretion of the plugin author whether |
| to maintain data format compatibility across different versions of the plugin.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p style="text-align:center;">Last modified 5 October 2009</p> |
| |
| <!--#include virtual="footer.txt"--> |