| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">SLURM Job Checkpoint Plugin API</a></h1> |
| |
| <h2> Overview</h2> |
| <p> This document describes SLURM job checkpoint plugins and the API that defines |
| them. It is intended as a resource to programmers wishing to write their own SLURM |
| job checkpoint plugins. This is version 0 of the API.</p> |
| |
| <p>SLURM job checkpoint plugins are SLURM plugins that implement the SLURM |
| API for checkpointing and restarting jobs. |
| The plugins must conform to the SLURM Plugin API with the following specifications:</p> |
| |
| <p><span class="commandline">const char plugin_type[]</span><br> |
| The major type must be "checkpoint." The minor type can be any recognizable |
| abbreviation for the type of scheduler. We recommend, for example:</p> |
| <ul> |
| <li><b>aix</b>—AIX system checkpoint.</li> |
| <li><b>none</b>—No job checkpoint.</li> |
| <li><b>ompi</b>—OpenMPI checkpoint (requires OpenMPI version 1.3 or higher).</li> |
| </ul></p> |
| |
| <p>The <span class="commandline">plugin_name</span> and |
| <span class="commandline">plugin_version</span> |
| symbols required by the SLURM Plugin API require no specialization for |
| job checkpoint support. |
| Note carefully, however, the versioning discussion below.</p> |
| |
| <p>The programmer is urged to study |
| <span class="commandline">src/plugins/checkpoint/checkpoint_aix.c</span> |
| for a sample implementation of a SLURM job checkpoint plugin.</p> |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>Data Objects</h2> |
| <p>The implementation must maintain (though not necessarily directly export) an |
| enumerated <span class="commandline">errno</span> to allow SLURM to discover |
| as practically as possible the reason for any failed API call. Plugin-specific enumerated |
| integer values may be used when appropriate. |
| |
| <p>These values must not be used as return values in integer-valued functions |
| in the API. The proper error return value from integer-valued functions is SLURM_ERROR. |
| The implementation should endeavor to provide useful and pertinent information by |
| whatever means is practical. |
| Successful API calls are not required to reset any errno to a known value. However, |
| the initial value of any errno, prior to any error condition arising, should be |
| SLURM_SUCCESS. </p> |
| |
| <p>There is also a checkpoint-specific error code and message that may be associated |
| with each job step.</p> |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>API Functions</h2> |
| <p>The following functions must appear. Functions which are not implemented should |
| be stubbed.</p> |
| |
| <p class="commandline">int slurm_ckpt_alloc_job (check_jobinfo_t *jobinfo);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Allocate storage for job-step specific |
| checkpoint data.</p> |
| <p style="margin-left:.2in"><b>Argument</b>: |
| <b>jobinfo</b> (output) returns pointer to the allocated storage.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR and set the errno to an appropriate value |
| to indicate the reason for failure.</p> |
| |
| <p class="commandline">int slurm_ckpt_free_job (check_jobinfo_t jobinfo);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Release storage for job-step specific |
| checkpoint data that was previously allocated by slurm_ckpt_alloc_job.</p> |
| <p style="margin-left:.2in"><b>Argument</b>: |
| <b>jobinfo</b> (input) pointer to the previously allocated storage.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR and set the errno to an appropriate value |
| to indicate the reason for failure.</p> |
| |
| <p class="commandline">int slurm_ckpt_pack_job (check_jobinfo_t jobinfo, Buf buffer);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Store job-step specific checkpoint data |
| into a buffer.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <b>jobinfo</b> (input) pointer to the previously allocated storage.<br> |
| <b>Buf</b> (input/output) buffer to which jobinfo has been appended.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR and set the errno to an appropriate value |
| to indicate the reason for failure.</p> |
| |
| <p class="commandline">int slurm_ckpt_unpack_job (check_jobinfo_t jobinfo, Buf buffer);</p> |
| <p style="margin-left:.2in"><b>Description</b>: Retrieve job-step specific checkpoint data |
| from a buffer.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:</br> |
| <b>jobinfo</b> (output) pointer to the previously allocated storage.<br> |
| <b>Buf</b> (input/output) buffer from which jobinfo has been removed.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR and set the errno to an appropriate value |
| to indicate the reason for failure.</p> |
| |
| <p class="commandline">int slurm_ckpt_op ( uint16_t op, uint16_t data, |
| struct step_record * step_ptr, time_t * event_time, |
| uint32_t *error_code, char **error_msg );</p> |
| <p style="margin-left:.2in"><b>Description</b>: Perform some checkpoint operation on a |
| specific job step.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <b>op</b> (input) specifies the operation to be performed. |
| Currently supported operations include |
| CHECK_ABLE (is job step currently able to be checkpointed), |
| CHECK_DISABLE (disable checkpoints for this job step), |
| CHECK_ENABLE (enable checkpoints for this job step), |
| CHECK_CREATE (create a checkpoint for this job step and continue its execution), |
| CHECK_VACATE (create a checkpoint for this job step and terminate it), |
| CHECK_RESTART (restart this previously checkpointed job step), and |
| CHECK_ERROR (return checkpoint-specific error information for this job step).<br> |
| <b>data</b> (input) operation-specific data.</br> |
| <b>step_ptr</b> (input/output) identifies the job step to be operated upon.</br> |
| <b>event_time</b> (output) identifies the time of a checkpoint or restart |
| operation.</br> |
| <b>error_code</b> (output) returns checkpoint-specific error code |
| associated with an operation.</br> |
| <b>error_msg</b> (output) identifies checkpoint-specific error message |
| associated with an operation.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: <br> |
| SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR and set the error_code and error_msg to an |
| appropriate value to indicate the reason for failure.</p> |
| |
| <p class="commandline">int slurm_ckpt_comp ( struct step_record * step_ptr, time_t event_time, |
| uint32_t error_code, char *error_msg );</p> |
| <p style="margin-left:.2in"><b>Description</b>: Note the completion of a checkpoint operation.</p> |
| <p style="margin-left:.2in"><b>Arguments</b>:<br> |
| <b>step_ptr</b> (input/output) identifies the job step to be operated upon.</br> |
| <b>event_time</b> (input) identifies the time that the checkpoint operation |
| began.</br> |
| <b>error_code</b> (input) checkpoint-specific error code associated |
| with an operation.</br> |
| <b>error_msg</b> (input) checkpoint-specific error message associated |
| with an operation.</p> |
| <p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure, |
| the plugin should return SLURM_ERROR and set the error_code and error_msg to an |
| appropriate value to indicate the reason for failure.</p> |
| |
| |
| <h2>Versioning</h2> |
| <p> This document describes version 0 of the SLURM checkpoint API. Future |
| releases of SLURM may revise this API. A scheduler plugin conveys its ability |
| to implement a particular API version using the mechanism outlined for SLURM plugins.</p> |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p style="text-align:center;">Last modified 21 August 2007</p> |
| |
| <!--#include virtual="footer.txt"--> |