blob: 31281bbcbaeadafd88d8db19cc700384a6bfc9ef [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1><a name="top">SLURM Checkpoint/Restart with BLCR</a></h1>
<h2>Overview</h2>
<p>SLURM version 2.0 has been integrated with
<a href="https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml">
Berkeley Lab Checkpoint/Restart (BLCR)</a> in order to provide automatic
job checkpoint/restart support.
Functionality provided includes:
<ol>
<li>Checkpoint of whole batch jobs in addition to job steps</li>
<li>Periodic checkpoint of batch jobs and job steps</li>
<li>Restart execution of batch jobs and job steps from checkpoint files</li>
<li>Automatically requeue and restart the execution of batch jobs upon
node failure</li>
</ol></p>
<p>The general mode of operation is to
<ol>
<li>Start the job step using the <b>srun_cr</b> command as described
below.</li>
<li>Create a checkpoint of <b>srun_cr</b> using BLCR's <b>cr_checkpoint</b>
command and cancel the job. <b>srun_cr</b> will automatically checkpoint
your job.</li>
<li>Restart <b>srun_cr</b> using BLCR's <b>cr_restart</b> command.
The job will be restarted using a newly allocated jobid.</li>
</ol>
<p><b>NOTE:</b> checkpoint/blcr cannot restart interactive jobs. It can
create checkpoints for both interactive and batch steps, but only
batch jobs can be restarted.</p>
<p><b>NOTE:</b> BLCR operation has been verified with MVAPICH2.
Some other MPI implementations should also work.</p>
<h2>User Commands</h2>
<p>The following documents SLURM changes specific to BLCR support.
Basic familiarity with SLURM commands is assumed.</p>
<h3>srun</h3>
<p>Several options have been added to support checkpoint restart:</p>
<ul>
<li><b>--checkpoint</b>: Specifies the interval between creating
checkpoints of the job step.
By default, the job step will have no checkpoints created.
Acceptable time formats include "minutes", "minutes:seconds",
"hours:minutes:seconds", "days-hours", "days-hours:minutes" and
"days-hours:minutes:seconds".</li>
<li><b>--checkpoint-dir</b>:Specify the directory where the checkpoint image
files of a job step will be stored.
The default value is the current working directory.
Checkpoint files will be of the form <i>"&lt;job_id&gt;.ckpt"</i> for jobs
and <i>"&lt;job_id&gt;.&lt;step_id&gt;.ckpt"</i> for job steps.</li>
<li><b>--restart-dir</b>: Specify the directory when the checkpoint image
files of a job step will be read from</li>
</li>
</ul>
<p>Environment variables are available for all of these options:</p>
<ul>
<li<b>SLURM_CHECKPOINT</b> is equivalent to <b>--checkpoint</b>:</li>
<li><b>SLURM_CHECKPOINT_DIR</b> is equivalent to <b>--checkpoint-dir</b></li>
<li><b>SLURM_RESTART_DIR</b> is equivalent to <b>--restart-dir</b></li>
</li>
</ul>
<p>The environment variable <b>SLURM_SRUN_CR_SOCKET</b> is used for job step
logic to interact with the <b>srun_cr</b> command.</p>
<h3>srun_cr</h3>
<p>This is a wrapper program for use with SLURM's <b>checkpoint/blcr</b>
plugin to checkpoint/restart tasks launched by srun.
The design of <b>srun_cr</b> is inspired by <b>mpiexec_cr</b> from MVAPICH2 and
<b>cr_restart</b> form BLCR.
It is a wrapper around the <b>srun</b> command to enable batch job
checkpoint/restart support when used with SLURM's <b>checkpoint/blcr</b> plugin.
<p>The <b>srun_cr</b> execute line options are identical to those of the
<b>srun</b> command.
See "man srun" for details.</p>
<p>After initialization, <b>srun_cr</b> registers a thread context callback
function.
Then it forks a process and executes "cr_run --omit srun" with its arguments.
<b>cr_run</b> is employed to exclude the <b>srun</b> process from being dumped
upon checkpoint.
All catchable signals except SIGCHLD sent to <b>srun_cr</b> will be forwarded
to the child <b>srun</b> process.
SIGCHLD will be captured to mimic the exit status of <b>srun</b> when it exits.
Then <b>srun_cr</b> loops waiting for termination of tasks being launched
from <b>srun</b>.</p>
<p>The step launch logic of SLURM is augmented to check if <b>srun</b> is
running under <b>srun_cr</b>.
If true, the environment variable <b>SURN_SRUN_CR_SOCKET</b> should be present,
the value of which is the address of a Unix domain socket created and listened
to be <b>srun_cr</b>.
After launching the tasks, <b>srun</b> tires to connect to the socket and sends
the job ID, step ID and the nodes allocated to the step to <b>srun_cr</b>.</p>
<p>Upon checkpoint, </b>srun_cr</b> checks to see if the tasks have been launched.
If not </b>srun_cr</b> first forwards the checkpoint request to the tasks by
calling the SLURM API <b>slurm_checkpoint_tasks()</b> before dumping its process
context.</p>
<p>Upon restart, <b>srun_cr</b> checks to see if the tasks have been previously
launched and checkpointed.
If true, the environment variable </b>SLURM_RESTART_DIR</b> is set to the
directory of the checkpoint image files of the tasks.
Then <b>srun</b> is forked and executed again.
The environment variable will be used by the <b>srun</b> command to restart
execution of the tasks from the previous checkpoint.</p>
<h3>sbatch</h3>
<p>Several options have been added to support checkpoint restart:</p>
<ul>
<li><b>--checkpoint</b>: Specifies the interval between periodic checkpoint
of a batch job.
By default, the job will have no checkpoints created.
Acceptable time formats include "minutes", "minutes:seconds",
"hours:minutes:seconds", "days-hours", "days-hours:minutes" and
"days-hours:minutes:seconds".</li>
<li><b>--checkpoint-dir</b>:Specify the directory when the checkpoint image
files of a batch job will be stored.
The default value is the current working directory.
Checkpoint files will be of the form <i>"&lt;job_id&gt;.ckpt"</i> for jobs
and <i>"&lt;job_id&gt;.&lt;step_id&gt;.ckpt"</i> for job steps.</li>
</li>
</ul>
<p>Environment variables are available for all of these options:</p>
<ul>
<li<b>SLURM_CHECKPOINT</b> is equivalent to <b>--checkpoint</b>:</li>
<li><b>SLURM_CHECKPOINT_DIR</b> is equivalent to <b>--checkpoint-dir</b></li>
</li>
</ul>
<h3>scontrol</h3>
<p><b>scontrol</b> is used to initiate checkpoint/restart requests.</p>
<ul>
<li><b>scontrol checkpoint create <i>jobid</i> [ImageDir=<i>dir</i>]
[MaxWait=<i>seconds</i>]</b><br>
Requests a checkpoint on a specific job.
For backward compatibility, if a job id is specified, all job steps of
it are checkpointed.
If a batch job id is specified, the entire job is checkpointed including
the batch shell and all running tasks of all job steps.
Upon checkpoint, the task launch command must forward the requests to
tasks it launched.
<ul>
<li><b>ImageDir</b> specifies the directory in which to save the checkpoint
image files. If specified, this takes precedence over any <b>--checkpoint-dir</b>
option specified when the job or job step were submitted.</li>
<li><b>MaxWait</b> specifies the maximum time permitted for a checkpoint
request to complete. The request will be considered failed if not
completed in this time period.</li>
</li>
</ul>
<li><b>scontrol checkpoint create <i>jobid.stepid</i> [ImageDir=<i>dir</i>]
[MaxWait=<i>seconds</i>]</b><br>
Requests a checkpoint on a specific job step.</li>
<li><b>scontrol checkpoint restart <i>jobid</i> [ImageDir=<i>dir</i>]
[StickToNodes]</b><br>
Restart a previously checkpointed batch job.
<ul>
<li><b>ImageDir</b> specifies the directory from which to read the checkpoint
image files.</li>
<li><b>StickToNodes</b> specifies that the job should be restarted on the
same set of nodes from which it was previously checkpointed.</li>
</ul></li>
</ul>
<h2>Configuration</h2>
<p>The following SLURM configuration parameter has been added:</p>
<ul>
<li><b>JobCheckpointDir</b>
Specifies the default directory for storing or reading job checkpoint
information. The data stored here is only a few thousand bytes per job
and includes information needed to resubmit the job request, not job's
memory image. The directory must be readable and writable by
<b>SlurmUser</b>, but not writable by regular users. The job memory images
may be in a different location as specified by <b>--checkpoint-dir</b>
option at job submit time or scontrol's <b>ImageDir</b> option.
</ul>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 26 January 2010</p>
<!--#include virtual="footer.txt"-->