| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">SLURM Checkpoint/Restart with BLCR</a></h1> |
| |
| <h2>Overview</h2> |
| <p>SLURM version 2.0 has been integrated with |
| <a href="https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml"> |
| Berkeley Lab Checkpoint/Restart (BLCR)</a> in order to provide automatic |
| job checkpoint/restart support. |
| Functionality provided includes: |
| <ol> |
| <li>Checkpoint of whole batch jobs in addition to job steps</li> |
| <li>Periodic checkpoint of batch jobs and job steps</li> |
| <li>Restart execution of batch jobs and job steps from checkpoint files</li> |
| <li>Automatically requeue and restart the execution of batch jobs upon |
| node failure</li> |
| </ol></p> |
| |
| <p>The general mode of operation is to |
| <ol> |
| <li>Start the job step using the <b>srun_cr</b> command as described |
| below.</li> |
| <li>Create a checkpoint of <b>srun_cr</b> using BLCR's <b>cr_checkpoint</b> |
| command and cancel the job. <b>srun_cr</b> will automatically checkpoint |
| your job.</li> |
| <li>Restart <b>srun_cr</b> using BLCR's <b>cr_restart</b> command. |
| The job will be restarted using a newly allocated jobid.</li> |
| </ol> |
| |
| <p><b>NOTE:</b> checkpoint/blcr cannot restart interactive jobs. It can |
| create checkpoints for both interactive and batch steps, but only |
| batch jobs can be restarted.</p> |
| |
| <p><b>NOTE:</b> BLCR operation has been verified with MVAPICH2. |
| Some other MPI implementations should also work.</p> |
| |
| <h2>User Commands</h2> |
| |
| <p>The following documents SLURM changes specific to BLCR support. |
| Basic familiarity with SLURM commands is assumed.</p> |
| |
| <h3>srun</h3> |
| |
| <p>Several options have been added to support checkpoint restart:</p> |
| <ul> |
| <li><b>--checkpoint</b>: Specifies the interval between creating |
| checkpoints of the job step. |
| By default, the job step will have no checkpoints created. |
| Acceptable time formats include "minutes", "minutes:seconds", |
| "hours:minutes:seconds", "days-hours", "days-hours:minutes" and |
| "days-hours:minutes:seconds".</li> |
| <li><b>--checkpoint-dir</b>:Specify the directory where the checkpoint image |
| files of a job step will be stored. |
| The default value is the current working directory. |
| Checkpoint files will be of the form <i>"<job_id>.ckpt"</i> for jobs |
| and <i>"<job_id>.<step_id>.ckpt"</i> for job steps.</li> |
| <li><b>--restart-dir</b>: Specify the directory when the checkpoint image |
| files of a job step will be read from</li> |
| </li> |
| </ul> |
| |
| <p>Environment variables are available for all of these options:</p> |
| <ul> |
| <li<b>SLURM_CHECKPOINT</b> is equivalent to <b>--checkpoint</b>:</li> |
| <li><b>SLURM_CHECKPOINT_DIR</b> is equivalent to <b>--checkpoint-dir</b></li> |
| <li><b>SLURM_RESTART_DIR</b> is equivalent to <b>--restart-dir</b></li> |
| </li> |
| </ul> |
| <p>The environment variable <b>SLURM_SRUN_CR_SOCKET</b> is used for job step |
| logic to interact with the <b>srun_cr</b> command.</p> |
| |
| <h3>srun_cr</h3> |
| |
| <p>This is a wrapper program for use with SLURM's <b>checkpoint/blcr</b> |
| plugin to checkpoint/restart tasks launched by srun. |
| The design of <b>srun_cr</b> is inspired by <b>mpiexec_cr</b> from MVAPICH2 and |
| <b>cr_restart</b> form BLCR. |
| It is a wrapper around the <b>srun</b> command to enable batch job |
| checkpoint/restart support when used with SLURM's <b>checkpoint/blcr</b> plugin. |
| |
| <p>The <b>srun_cr</b> execute line options are identical to those of the |
| <b>srun</b> command. |
| See "man srun" for details.</p> |
| |
| <p>After initialization, <b>srun_cr</b> registers a thread context callback |
| function. |
| Then it forks a process and executes "cr_run --omit srun" with its arguments. |
| <b>cr_run</b> is employed to exclude the <b>srun</b> process from being dumped |
| upon checkpoint. |
| All catchable signals except SIGCHLD sent to <b>srun_cr</b> will be forwarded |
| to the child <b>srun</b> process. |
| SIGCHLD will be captured to mimic the exit status of <b>srun</b> when it exits. |
| Then <b>srun_cr</b> loops waiting for termination of tasks being launched |
| from <b>srun</b>.</p> |
| |
| <p>The step launch logic of SLURM is augmented to check if <b>srun</b> is |
| running under <b>srun_cr</b>. |
| If true, the environment variable <b>SURN_SRUN_CR_SOCKET</b> should be present, |
| the value of which is the address of a Unix domain socket created and listened |
| to be <b>srun_cr</b>. |
| After launching the tasks, <b>srun</b> tires to connect to the socket and sends |
| the job ID, step ID and the nodes allocated to the step to <b>srun_cr</b>.</p> |
| |
| <p>Upon checkpoint, </b>srun_cr</b> checks to see if the tasks have been launched. |
| If not </b>srun_cr</b> first forwards the checkpoint request to the tasks by |
| calling the SLURM API <b>slurm_checkpoint_tasks()</b> before dumping its process |
| context.</p> |
| |
| <p>Upon restart, <b>srun_cr</b> checks to see if the tasks have been previously |
| launched and checkpointed. |
| If true, the environment variable </b>SLURM_RESTART_DIR</b> is set to the |
| directory of the checkpoint image files of the tasks. |
| Then <b>srun</b> is forked and executed again. |
| The environment variable will be used by the <b>srun</b> command to restart |
| execution of the tasks from the previous checkpoint.</p> |
| |
| <h3>sbatch</h3> |
| |
| <p>Several options have been added to support checkpoint restart:</p> |
| <ul> |
| <li><b>--checkpoint</b>: Specifies the interval between periodic checkpoint |
| of a batch job. |
| By default, the job will have no checkpoints created. |
| Acceptable time formats include "minutes", "minutes:seconds", |
| "hours:minutes:seconds", "days-hours", "days-hours:minutes" and |
| "days-hours:minutes:seconds".</li> |
| <li><b>--checkpoint-dir</b>:Specify the directory when the checkpoint image |
| files of a batch job will be stored. |
| The default value is the current working directory. |
| Checkpoint files will be of the form <i>"<job_id>.ckpt"</i> for jobs |
| and <i>"<job_id>.<step_id>.ckpt"</i> for job steps.</li> |
| </li> |
| </ul> |
| |
| <p>Environment variables are available for all of these options:</p> |
| <ul> |
| <li<b>SLURM_CHECKPOINT</b> is equivalent to <b>--checkpoint</b>:</li> |
| <li><b>SLURM_CHECKPOINT_DIR</b> is equivalent to <b>--checkpoint-dir</b></li> |
| </li> |
| </ul> |
| |
| <h3>scontrol</h3> |
| |
| <p><b>scontrol</b> is used to initiate checkpoint/restart requests.</p> |
| <ul> |
| <li><b>scontrol checkpoint create <i>jobid</i> [ImageDir=<i>dir</i>] |
| [MaxWait=<i>seconds</i>]</b><br> |
| Requests a checkpoint on a specific job. |
| For backward compatibility, if a job id is specified, all job steps of |
| it are checkpointed. |
| If a batch job id is specified, the entire job is checkpointed including |
| the batch shell and all running tasks of all job steps. |
| Upon checkpoint, the task launch command must forward the requests to |
| tasks it launched. |
| <ul> |
| <li><b>ImageDir</b> specifies the directory in which to save the checkpoint |
| image files. If specified, this takes precedence over any <b>--checkpoint-dir</b> |
| option specified when the job or job step were submitted.</li> |
| <li><b>MaxWait</b> specifies the maximum time permitted for a checkpoint |
| request to complete. The request will be considered failed if not |
| completed in this time period.</li> |
| </li> |
| </ul> |
| |
| <li><b>scontrol checkpoint create <i>jobid.stepid</i> [ImageDir=<i>dir</i>] |
| [MaxWait=<i>seconds</i>]</b><br> |
| Requests a checkpoint on a specific job step.</li> |
| |
| <li><b>scontrol checkpoint restart <i>jobid</i> [ImageDir=<i>dir</i>] |
| [StickToNodes]</b><br> |
| Restart a previously checkpointed batch job. |
| <ul> |
| <li><b>ImageDir</b> specifies the directory from which to read the checkpoint |
| image files.</li> |
| <li><b>StickToNodes</b> specifies that the job should be restarted on the |
| same set of nodes from which it was previously checkpointed.</li> |
| </ul></li> |
| </ul> |
| |
| <h2>Configuration</h2> |
| |
| <p>The following SLURM configuration parameter has been added:</p> |
| <ul> |
| <li><b>JobCheckpointDir</b> |
| Specifies the default directory for storing or reading job checkpoint |
| information. The data stored here is only a few thousand bytes per job |
| and includes information needed to resubmit the job request, not job's |
| memory image. The directory must be readable and writable by |
| <b>SlurmUser</b>, but not writable by regular users. The job memory images |
| may be in a different location as specified by <b>--checkpoint-dir</b> |
| option at job submit time or scontrol's <b>ImageDir</b> option. |
| </ul> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p style="text-align:center;">Last modified 26 January 2010</p> |
| |
| <!--#include virtual="footer.txt"--> |