| <!--#include virtual="header.txt"--> |
| |
| <h1>IBM AIX User and Administrator Guide</h1> |
| |
| <h2>Overview</h2> |
| |
| <p>This document describes the unique features of SLURM on the |
| IBM AIX computers with a Federation switch. |
| You should be familiar with the SLURM's mode of operation on Linux clusters |
| before studying the relatively few differences in IBM system operation |
| described in this document.</p> |
| |
| <h2>User Tools</h2> |
| |
| <p>The normal set of SLURM user tools: srun, scancel, sinfo, smap, squeue and scontrol |
| provide all of the expected services except support for job steps. |
| While the srun command will launch the tasks of a job step on an IBM |
| AIX system, it does not support use of the Federation switch or IBM's MPI. |
| Job steps should be launched using IBM's poe command. |
| This architecture insures proper operation of all IBM tools.</p> |
| |
| <p>You will use srun to submit a batch script to SLURM. |
| This script should contain one or more invocations of poe to launch |
| the tasks. |
| If you want to run a job interactively, just execute poe directly. |
| Poe will recognize that it lacks a SLURM job allocation (the SLURM_JOBID |
| environment variable will be missing) and create the SLURM allocation |
| prior to launching tasks.</p> |
| |
| <p>Each poe invocation (or SLURM job step) can have it's own network |
| specification. |
| For example one poe may use IP mode communications and the next use |
| User Space (US) mode communcations. |
| This enhancement to normal poe functionality may be accomplished by |
| setting the SLURM_NETWORK environment variable. |
| The format of SLURM_NETWORK is "network.[protocol],[type],[usage],[mode]". |
| For example "network.mpi,en0,shared,ip". |
| See LoadLeveler documentation for more details.</p> |
| |
| <h2>Checkpoint</h2> |
| |
| <p>SLURM supports checkpoint via poe. |
| In order to enable checkpoint, the shell executing the poe command must |
| itself be initiated with the environment variable <b>CHECKPOINT=yes</b>. |
| One file is written for each node on which the job is executing, plus |
| another for the script executing poe.a |
| By default, the checkpoint files will be written to the current working |
| directory of the job. |
| Names and locations of these files can be controled using the |
| environment variables <b>MP_CKPTFILE</b> and <b>MP_CKPTDIR</b>. |
| Use the squeue command to identify the job and job step of interest. |
| To initiate a checkpoint in which the job step will continue execution, |
| use the command: <br> |
| <b>scontrol check create <i>job_id.step_id</i></b><br> |
| To initiate a checkpoint in which the job step will terminate afterwards, |
| use the command: <br> |
| <b>scontrol check vacate <i>job_id.step_id</i></b></p> |
| |
| <h2>System Administration</h2> |
| |
| <p>Three unique components are required to use SLURM on an IBM system.</p> |
| <ol> |
| <li>The Federation switch plugin is required. |
| This component is packaged with the SLURM distrbution.</li> |
| <li>There is a process tracking kernel extension required. |
| This is used to insure that all processes associated with a job |
| are tracked. |
| SLURM normatlly uses session ID and process group ID on Linux systems, |
| but these mechanisms can not prevent user processes from establishing |
| their own session or process group and thus "escape" from SLURM |
| tracking. |
| This kernel extension is not packaged with SLURM, but is available |
| upon request.</li> |
| <li>The final component is a library that accepts poe library calls |
| and performs actions in SLURM to satisfy these requests, such |
| as launching tasks. |
| This library is based upon IBM Confidential information and is |
| not at this time available for distribution. |
| Interested parties are welcome to pursue the possible distribution |
| of this library with IBM and SLURM developers.</li> |
| </ol> |
| <p>Until this last issue is resolved, use of SLURM on an IBM AIX system |
| should not be viewed as a supported configuration (at least outside |
| of LLNL, which established a contract with IBM for this purpose).</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p style="text-align:center;">Last modified 31 August 2005</p></td> |
| |
| <!--#include virtual="footer.txt"--> |