blob: 7b1baf6a6a3111c7b353bca2c96eb4698356cdc5 [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1><a name="top">SLURM Job Checkpoint Plugin API</a></h1>
<h2> Overview</h2>
<p> This document describes SLURM job checkpoint plugins and the API that defines
them. It is intended as a resource to programmers wishing to write their own SLURM
job checkpoint plugins. This is version 0 of the API.</p>
<p>SLURM job checkpoint plugins are SLURM plugins that implement the SLURM
API for checkpointing and restarting jobs.
The plugins must conform to the SLURM Plugin API with the following specifications:</p>
<p><span class="commandline">const char plugin_type[]</span><br>
The major type must be &quot;checkpoint.&quot; The minor type can be any recognizable
abbreviation for the type of scheduler. We recommend, for example:</p>
<ul>
<li><b>aix</b>&#151;AIX system checkpoint.</li>
<li><b>none</b>&#151;No job checkpoint.</li>
<li><b>ompi</b>&#151;OpenMPI checkpoint (requires OpenMPI version 1.3 or higher).</li>
</ul></p>
<p>The <span class="commandline">plugin_name</span> and
<span class="commandline">plugin_version</span>
symbols required by the SLURM Plugin API require no specialization for
job checkpoint support.
Note carefully, however, the versioning discussion below.</p>
<p>The programmer is urged to study
<span class="commandline">src/plugins/checkpoint/checkpoint_aix.c</span>
for a sample implementation of a SLURM job checkpoint plugin.</p>
<p class="footer"><a href="#top">top</a></p>
<h2>Data Objects</h2>
<p>The implementation must maintain (though not necessarily directly export) an
enumerated <span class="commandline">errno</span> to allow SLURM to discover
as practically as possible the reason for any failed API call. Plugin-specific enumerated
integer values may be used when appropriate.
<p>These values must not be used as return values in integer-valued functions
in the API. The proper error return value from integer-valued functions is SLURM_ERROR.
The implementation should endeavor to provide useful and pertinent information by
whatever means is practical.
Successful API calls are not required to reset any errno to a known value. However,
the initial value of any errno, prior to any error condition arising, should be
SLURM_SUCCESS. </p>
<p>There is also a checkpoint-specific error code and message that may be associated
with each job step.</p>
<p class="footer"><a href="#top">top</a></p>
<h2>API Functions</h2>
<p>The following functions must appear. Functions which are not implemented should
be stubbed.</p>
<p class="commandline">int slurm_ckpt_alloc_job (check_jobinfo_t *jobinfo);</p>
<p style="margin-left:.2in"><b>Description</b>: Allocate storage for job-step specific
checkpoint data.</p>
<p style="margin-left:.2in"><b>Argument</b>:
<b>jobinfo</b> (output) returns pointer to the allocated storage.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.</p>
<p class="commandline">int slurm_ckpt_free_job (check_jobinfo_t jobinfo);</p>
<p style="margin-left:.2in"><b>Description</b>: Release storage for job-step specific
checkpoint data that was previously allocated by slurm_ckpt_alloc_job.</p>
<p style="margin-left:.2in"><b>Argument</b>:
<b>jobinfo</b> (input) pointer to the previously allocated storage.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.</p>
<p class="commandline">int slurm_ckpt_pack_job (check_jobinfo_t jobinfo, Buf buffer);</p>
<p style="margin-left:.2in"><b>Description</b>: Store job-step specific checkpoint data
into a buffer.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<b>jobinfo</b> (input) pointer to the previously allocated storage.<br>
<b>Buf</b> (input/output) buffer to which jobinfo has been appended.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.</p>
<p class="commandline">int slurm_ckpt_unpack_job (check_jobinfo_t jobinfo, Buf buffer);</p>
<p style="margin-left:.2in"><b>Description</b>: Retrieve job-step specific checkpoint data
from a buffer.</p>
<p style="margin-left:.2in"><b>Arguments</b>:</br>
<b>jobinfo</b> (output) pointer to the previously allocated storage.<br>
<b>Buf</b> (input/output) buffer from which jobinfo has been removed.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the errno to an appropriate value
to indicate the reason for failure.</p>
<p class="commandline">int slurm_ckpt_op ( uint16_t op, uint16_t data,
struct step_record * step_ptr, time_t * event_time,
uint32_t *error_code, char **error_msg );</p>
<p style="margin-left:.2in"><b>Description</b>: Perform some checkpoint operation on a
specific job step.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<b>op</b> (input) specifies the operation to be performed.
Currently supported operations include
CHECK_ABLE (is job step currently able to be checkpointed),
CHECK_DISABLE (disable checkpoints for this job step),
CHECK_ENABLE (enable checkpoints for this job step),
CHECK_CREATE (create a checkpoint for this job step and continue its execution),
CHECK_VACATE (create a checkpoint for this job step and terminate it),
CHECK_RESTART (restart this previously checkpointed job step), and
CHECK_ERROR (return checkpoint-specific error information for this job step).<br>
<b>data</b> (input) operation-specific data.</br>
<b>step_ptr</b> (input/output) identifies the job step to be operated upon.</br>
<b>event_time</b> (output) identifies the time of a checkpoint or restart
operation.</br>
<b>error_code</b> (output) returns checkpoint-specific error code
associated with an operation.</br>
<b>error_msg</b> (output) identifies checkpoint-specific error message
associated with an operation.</p>
<p style="margin-left:.2in"><b>Returns</b>: <br>
SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the error_code and error_msg to an
appropriate value to indicate the reason for failure.</p>
<p class="commandline">int slurm_ckpt_comp ( struct step_record * step_ptr, time_t event_time,
uint32_t error_code, char *error_msg );</p>
<p style="margin-left:.2in"><b>Description</b>: Note the completion of a checkpoint operation.</p>
<p style="margin-left:.2in"><b>Arguments</b>:<br>
<b>step_ptr</b> (input/output) identifies the job step to be operated upon.</br>
<b>event_time</b> (input) identifies the time that the checkpoint operation
began.</br>
<b>error_code</b> (input) checkpoint-specific error code associated
with an operation.</br>
<b>error_msg</b> (input) checkpoint-specific error message associated
with an operation.</p>
<p style="margin-left:.2in"><b>Returns</b>: SLURM_SUCCESS if successful. On failure,
the plugin should return SLURM_ERROR and set the error_code and error_msg to an
appropriate value to indicate the reason for failure.</p>
<h2>Versioning</h2>
<p> This document describes version 0 of the SLURM checkpoint API. Future
releases of SLURM may revise this API. A scheduler plugin conveys its ability
to implement a particular API version using the mechanism outlined for SLURM plugins.</p>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 21 August 2007</p>
<!--#include virtual="footer.txt"-->