doc/html/ibm-pe.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>IBM Parallel Environment User and Administrator Guide</h1>

 <p>
 <a href="#overview">Overview</a><br>
 <a href="#user">User Tools</a><br>
 <a href="#admin">System Administration</a></p>

 <h2><a name="overview">Overview</a></h2>

 <p>This document describes the unique features of SLURM on the
 IBM computers with the
 <a href="http://www-03.ibm.com/systems/software/parallel/">Parallel Environment (PE)</a>
 software. You should be familiar with the SLURM's mode of operation on Linux
 clusters before studying the relatively few differences in operation on systems
 with PE, which are described in this document.</p>

 <p>Note that Slurm is designed to be a replacement for IBM's LoadLeveler.
 They are not designed to concurrently schedule resources.
 Slurm provides manages network resources and provides the POE command with
 a library that emulates LoadLeveler functionality.</p>

 <h2><a name="user">User Tools</a></h2>

 <p>The normal set of SLURM user tools: srun, scancel, sinfo, squeue, scontrol,
 etc. provide all of the expected services.
 The only SLURM command not supported is sattach.
 Job steps are launched using the
 srun command, which translates its options and invokes IBM's poe command. The
 poe command actually launches the tasks. The poe command may also be invoked
 directly if desired. The actual task launch process is as follows:</p>
 <ol>
 <li>Invoke srun command with desired options.</li>
 <li>The srun command creates a job allocation (if necessary).</li>
 <li>The srun command translates its options and invokes the poe command.</li>
 <li>The poe command loads a SLURM library that provides various resource
 management functionality.</li>
 <li>The poe command, through the SLURM library, creates a SLURM step
 allocation and launches a process named
 "pmdv12" on the appropriate compute nodes. Note that the "v12" on the end of
 the process name represents the version number of the "pmd" process and is
 subject to change.</li>
 <li>The poe command interacts with the pmdv12 process to launch the application
 tasks, handle their I/O, etc. Since the task launch procedure occurs outside of
 SLURM's control, none of the normal task-level SLURM support is available.</li>
 <li>The poe command, through the SLURM library, reports the completion of the
 job step.</li>
 </ol>

 <h3>Network Options</h3>

 <p>Each job step can specify it's desired network options.
 For example, one job step may use IP mode communications and the next use
 User Space (US) mode communications.
 Network specifications may be specified using srun's --network option or the
 SLURM_NETWORK environment variable. Supported network options include:</p>

 <ul>
 <li>Network protocol</li>
   <ul>
   <li><b>ip</b> Internet protocol, version 4</li>
   <li><b>ipv4</b> Internet protocol, version 4 (default)</li>
   <li><b>ipv6</b> Internet protocol, version 6</li>
   <li><b>us</b> User Space protocol, may be combined with ibv4 or ipv6</li>
   </ul>
 <li>Programming interface</li>
   <ul>
   <li><b>lapi</b> Low-level Application Programming Interface</li>
   <li><b>mpi</b> Message Passing Interface (default)</li>
   <li><b>pami</b> Parallel Active Message Interface</li>
   <li><b>shmem</b> OpenSHMEM interface</li>
   <li><b>upc</b> Unified Parallel C Interface</li>
   </ul>
 <li>Other options</li>
   <ul>
   <li><b>bulk_xfer [=<i>resources</i>]</b>
   Enable bulk transfer of data using Remote Direct-Memory Access (RDMA).
   The optional <i>resources</i> specification is a numeric value which can have
   a suffix of "k", "K", "m", "M", "g" or "G" for kilobytes, megabytes or
   gigabytes.
   <b>NOTE:</b> The <i>resources</i> specification is not supported by the
   underlying IBM infrastructure as of Parallel Environment version 2.2 and no
   value should be specified at this time.</li>
   <li><b>cau=<i>count</i></b>
   Specify the count of Collective Acceleration Units (CAU) required per
   programming interface.
   Default value is zero.
   Applies only to IBM Power7-IH processors.
   POE requires that if <b>cau</b> has a non-zero value then <b>us</b>,
   <b>devtype=IB</b> or <b>devtype=HFI</b> must be explicitly specified otherwise
   the request may attempt to allocate CAU with IP communications and fail.
   <li><b>devname=<i>name</i></b>
   Specify the name of an individual network adapter to use.
   For example: "eth0" or "mlx4_0".</li>
   <li><b>devtype=<i>type</i></b>
   Specify the device type to use for communications.
   The supported values of <i>type</i> are:
   "IB" (InfiniBand), "HFI" (P7 Host Fabric Interface),
   "IPONLY" (IP-Only interfaces), "HPCE" (HPC Ethernet), and
   "KMUX" (Kernel Emulation of HPCE).
   The devices allocated to a job must all be of the same type.
   The default value depends upon depends upon what hardware is available and in
   order of preferences is IPONLY (which is not considered in User Space mode),
   HFI, IB, HPCE, and KMUX.</li>
   <li><b>immed=<i>count</i></b>
   Specify the count of immediate send slots per adapter window.
   Default value is zero.
   Applies only to IBM Power7-IH processors.
   <li><b>instances=<i>count</i></b>
   Specify number of network connections for each task on each network connection.
   The default instance count is 1.</li>
   <li><b>sn_all</b> Use all available switch adapters (default).
   This option can not be combined with sn_single.</li>
   <li><b>sn_single</b> Use only one switch adapters.
   This option can not be combined with sn_all.
   If multiple adapters of different types exist, the devname and/or
   devtype option can also be used to select one of them.</li>
   </ul>
 </ul>

 <p>Examples of network option use:
 <br><br>
 <b>--network=sn_all,mpi</b><br>
 Allocate one switch window per task on each node and every network supporting
 MPI.
 <br><br>
 <b>--network=sn_all,mpi,bulk_xfer,us</b><br>
 Allocate one switch window per task on each node and every network supporting
 MPI and user space communications. Reserve resources for RDMA.
 <br><br>
 <b>--network=sn_all,instances=3,mpi</b><br>
 Allocate three switch window per task on each node and every network supporting
 MPI.
 <br><br>
 <b>--network=sn_all,mpi,pami</b><br>
 Allocate one switch window per task on each node and every network supporting
 MPI and a second window supporting PAMI.
 <br><br>
 <b>--network=devtype=ib,instances=2,lapi,mpi</b><br>
 On every InfiniBand network connection, allocate two switch windows each for
 both lapi and mpi interfaces. If each node has one InfiniBand network connection,
 this would result in four switch windows per task.
 </p>

 <p><b>NOTE:</b> Switch resources on a node are shared between all job steps on
 that node. If a job step can not be initiated due to insufficient switch
 resources being available, that job step will periodically retry allocating
 resources for the lifetime of the job unless srun's --immediate option is
 used.</p>

 <h3>Debugging</h3>

 <p>Most debuggers require detailed information about launched tasks such as
 host name, process ID, etc. Since that information is only available from
 poe (which launches those tasks), the srun command wrapper can not be used
 for most debugging purposes. You or the debugging tool must invoke the poe
 command directly. In order to facilitate the direct use of poe, srun's
 <br><b>--launch-cmd</b> option may be used with the options normally used.
 srun will then print the equivalent poe command line, which
 can subsequently be used with the debugger. The poe options must be explicitly
 set even if the command is executed from within an existing SLURM allocation
 (i.e. from within an allocation created by the salloc or sbatch command).</p>

 <h3>Checkpoint</h3>

 <p>Checkpoint/restart is only supported with LoadLeveler.
 The <b>checkpoint/poe</b> plugin is based SLURM support for checkpoint support
 of poe in in the 2005 time frame and does not work with the current versions
 of poe.</p>

 <!--
 <p>In order to enable checkpoint, the shell executing the poe command must
 itself be initiated with the environment variable <b>CHECKPOINT=yes</b>.
 One file is written for each node on which the job is executing, plus
 another for the script executing poe.
 By default, the checkpoint files will be written to the current working
 directory of the job.
 Names and locations of these files can be controlled using the
 environment variables <b>MP_CKPTFILE</b> and <b>MP_CKPTDIR</b>.
 Use the squeue command to identify the job and job step of interest.
 To initiate a checkpoint in which the job step will continue execution,
 use the command: <br>
 <b>scontrol check create <i>job_id.step_id</i></b><br>
 To initiate a checkpoint in which the job step will terminate afterwards,
 use the command: <br>
 <b>scontrol check vacate <i>job_id.step_id</i></b></p>
 -->

 <h3>Unsupported Options</h3>

 <p>Some SLURM options can not be supported by PE and the following srun options
 are silently ignored:</p>
 <ul>
 <li>-D, --chdir (set working directory)</li>
 <li>-K, --kill-on-bad-exit (terminate step if any task has a non-zero exit code)</li>
 <li>-k, --no-kill (set to not kill job upon node failure)</li>
 <li>--ntasks-per-core (number of tasks to invoke per code)</li>
 <li>--ntasks-per-socket (number of tasks to invoke per socket)</li>
 <li>-O, --overcommit (over subscribe resources)</li>
 <li>--resv-ports (communication ports reserved for OpenMPI)</li>
 <li>--runjob-opts (used only on IBM BlueGene/Q systems)</li>
 <li>--signal (signal to send when near time limit and the remaining time required)</li>
 <li>--sockets-per-node (number of sockets per node required)</li>
 <li>--task-epilog (per-task epilog program)</li>
 <li>--task-prolog (per-task prolog program)</li>
 <li>-u, --unbuffered (avoid line buffering)</li>
 <li>-W, --wait (specify job swait time after first task exit)</li>
 <li>-Z, --no-allocate (launch tasks without creating job allocation></li>
 </ul>

 <p>A limited subset of srun's --cpu-bind options are supported as shown
 below. If the --cpus-per-task option is not specified, a value of one is used
 by default. Note that SLURM's mask_cpu and map_cpu options are not supported,
 nor are options to bind to sockets or boards.</p>
 <table border=1 cellspacing=0 cellpadding=4>
 <tr>
     <td><b>SLURM option</b></td>
     <td><b>POE equivalent</b></td>
 </tr>
 <tr>
     <td>--cpu-bind=threads --cpus-per-task=#</td>
     <td>-task_affinity=cpu:#</td>
 </tr>
 <tr>
     <td>--cpu-bind=cores --cpus-per-task=#</td>
     <td>-task_affinity=core:#</td>
 </tr>
 <tr>
     <td>--cpu-bind=rank --cpus-per-task=#</td>
     <td>-task_affinity=cpu:#</td>
 </tr>
 </table>

 <p>In addition, file name specifications with expression substitution
 (e.g. file names including "%j" for job_ID, "%J" for job_ID.step_ID,
 "%s" for step_ID, "%t" for task_ID, or "%n" for node_ID) are not supported.
 This effects the following options:</p>
 <ul>
 <li>-e, --error</li>
 <li>-i, --input</li>
 <li>-o, --output</li>
 </ul>

 <p>For the srun command's --multi-prog option (Multiple Program,
 Multiple Data configurations), the command file will be translated from
 SLURM's format to a POE format. POE does not support SLURM expressions
 in the MPMD configuration file (e.g. "%t" will not be replaced with the task's
 number and "%o" will not be replaced with the task's offset within
 this range). The command file will be stored in a temporary file
 in the ".slurm" subdirectory of your home directory. The file name will have a
 prefix of "slurm_cmdfile." followed by srun's process id. If the srun command
 does not terminate gracefully, this file may persist after the job step's
 termination and not be purged. You can find and purge these files using
 commands of the form shown below. You should only purge older files, after
 their job steps are completed.</p>
 <pre>
 $ ls -l ~/.slurm/slurm_cmdfile.*
 $ rm    ~/.slurm/slurm_cmdfile.*
 </pre>

 <p>The -L/--label option differs slightly in that when the output from multiple
 tasks are identical, they are combined on a single line with the prefix
 identifying which task(s) generated the output. In addition, there is a colon
 but no space between the task IDs and output. For example:</p>
 <pre>
 # SLURM OUTPUT
 0: foo
 1: foo
 2: foo
 0: bar
 1: barr
 2: bar

 # POE OUTPUT
 0-2:foo
 0 2:bar
 1:barr
 </pre>

 <p>In addition, when srun's --multi-prog option (for Multiple Program,
 Multiple Data configurations) is used with the -L/--label option then a job
 step ID, colon and space will precede the task ID and colon. For example:</p>
 <pre>
 # SLURM OUTPUT
 0: zero
 1: one
 2: two

 # POE OUTPUT (FOR STEP ID 1)
 1: 0: zero
 1: 1: one
 1: 2: two
 </pre>

 <p>The srun command is not able to report task status upon receipt of a SIGINT
 signal (ctrl-c interrupt from keyboard), however two SIGINT signals within a
 one second interval will terminate the job as on other SLURM configurations.</p>

 <h3>Environment Variables</h3>

 <p>Since SLURM is not directly launching user tasks, the following environment
 variables are NOT available with POE:</p>
 <ul>
 <li>SLURM_CPU_BIND_LIST</li>
 <li>SLURM_CPU_BIND_TYPE</li>
 <li>SLURM_CPU_BIND_VERBOSE</li>
 <li>SLURM_CPUS_ON_NODE</li>
 <li>SLURM_GTIDS</li>
 <li>SLURM_LAUNCH_NODE_IPADDR</li>
 <li>SLURM_LOCALID</li>
 <li>SLURM_MEM_BIND_LIST</li>
 <li>SLURM_MEM_BIND_TYPE</li>
 <li>SLURM_MEM_BIND_VERBOSE</li>
 <li>SLURM_NODEID</li>
 <li>SLURM_PROCID</li>
 <li>SLURM_SRUN_COMM_HOST</li>
 <li>SLURM_SRUN_COMM_PORT</li>
 <li>SLURM_TASK_PID</li>
 <li>SLURM_TASKS_PER_NODE</li>
 <li>SLURM_TOPOLOGY_ADDR</li>
 <li>SLURM_TOPOLOGY_ADDR_PATTERN</li>
 </ul>

 <p>Note that POE sets a variety of environment variables that provide similar
 information to some the missing SLURM environment variables. Particularly
 note the following environment variables:</p>
 <ul>
 <li>MP_I_UPMD_HOSTNAME (local hostname)</li>
 <li>MP_CHILD (global task ID)</li>
 </ul>

 <h3>Gang Scheduling</h3>

 <p>SLURM can be configured to gang schedule (time slice) parallel jobs by
 alternately suspending and resuming them. Depending upon the the number of
 jobs configured to time slice and the time slice interval (as specified in
 the <i>slurm.conf</i> file using the <b>Shared</b> and <b>SchedulerTimeSlice</b>
 options), the job may experience communication timeouts. Set the environment
 variable <b>MP_TIMEOUT</b> to specify an appropriate communication timeout
 value. Note that the default timeout is 150 seconds. See
 <a href="gang_scheduling.html">Gang Scheduling</a> for more information.</p>
 <pre>
 export MP_TIMEOUT=600
 </pre>

 <h3>Other User Notes</h3>

 <p>POE can not support a step ID of zero. In POE installations, a job's
 first step ID will be 1 rather than 0.</p>

 <p>Since the SLURM step launches the PE PMD process instead of the
 users tasks the exit code stored in accounting will be that of the PMD
 instead of the users tasks.  The exit code of the job allocation if
 started with srun will be correct as we will grab the exit code from
 the wrapped poe.</p>

 <h2><a name="admin">System Administration</a></h2>

 <p>There are several critical SLURM configuration parameters for use with PE.
 These configuration parameters should be set in your <b>slurm.conf</b> file.
 <b>LaunchType</b> defines the task launch mechanism to be used and must be set
 to <b>launch/poe</b>. This configuration means that poe will be used to launch
 all applications.
 <b>SwitchType</b> defines the mechanism used to manage the network switch and
 it must be set to <b>switch/nrt</b> and use IBM's Network Resource Table (NRT)
 interface on ALL nodes in the cluster (only a few of them will actually interact
 with IBM's NRT library, but all need to work with the NRT data structures).
 Task launch is slower in this environment than with a typical Linux cluster
 and the <b>MessageTimeout</b> must be configured to a sufficiently large value
 so that large parallel jobs can be launched without SLURM's job step
 credentials expiring.
 When switch resources are allocated to a job, all processes spawned by that job
 must be terminated before the switch resources can be released for use by another
 program. This means that reliable tracking of all spawned processes is critical
 for switch use. Use of <b>ProctrackType=proctrack/cgroup</b> is strongly
 recommended. Use of any other process tracking plugin significantly increases
 the likelyhood of orphan processes that must be manually identified and killed
 in order to release switch resources.
 While it is possible to to configure distinct <b>NodeName</b> and
 <b>NodeHostName</b> parameters for the compute nodes, this is discouraged
 for performance reasons (the <b>switch/nrt</b> plugin is not optimized for
 such a configuration).</p>
 <pre>
 # Excerpt of slurm.conf
 LaunchType=launch/poe
 SwitchType=switch/nrt
 MessageTimeout=30
 ProctrackType=proctrack/cgroup
 </pre>

 <p>In order for these plugins to be built, the locations of the POE Resource
 Manager header file (permapi.h) the NRT header file (nrt.h) and NRT library
 (libnrt.so) must be identified at the time the SLURM is built.
 The header files are needed at build time to get NRT data structures, function
 return codes, etc.
 The NRT library location is needed to identify where the library is to be
 loaded from by Slurm's switch/nrt plugin, but the library is only actually
 when needed and only by the slurmd daemon.</p>

 <p>SLURM searches for the header files in the /usr/include directory by default.
 If the files are not installed there, you can specify a different location using
 the <b>--with-nrth=PATH</b> option to the configure program, where "PATH" is
 the fully qualified pathname of the parent directory(ies) of the nrt.h and
 permapi.h files.
 SLURM searches for the libnrt.so file in the /usr/lib and /usr/lib64 directories
 by default. If the file is not installed there, you can specify a different
 location using the <b>--with-nrtlib=PATH</b> option to the configure program,
 where "PATH" is the fully qualified pathname of the parent directory of the
 libnrt.so file.
 Alternately these values may be specified in your ~/.rpmmacros file.
 For example:</p>
 <pre>
 %_with_nrth      "/opt/ibmhpc/pecurrent/base/include"
 %_with_libnrt    "/opt/ibmhpc/pecurrent/base/intel/lib64"
 </pre>

 <p><b>IMPORTANT:</b>The poe command interacts with SLURM by loading a
   SLURM library providing a variety of functions for its use. The
   library name is <i>"libpermapi.so"</i> and it is in installed with the
   other SLURM libraries in the subdirectory "lib/slurm". You must
   modify the link of /usr/lib64/libpermapi.so to point to the location
   of the slurm version of this library.</p>
 <p>Modifying the "/etc/poe.limits" file is <b>not</b> enough. The poe
   command is loading and using the libpermapi.so library initially
   from the /usr/lib64 directory. It later reads the /etc/poe.limits
   file and loads the library listed there.  In order for poe to work
   with SLURM, it needs to use the "libpermapi.so" generated by SLURM
   for all of its functions.  Until poe is modified to only load the
   correct library, it is necessary for /usr/lib64/libpermapi.so to
   contain SLURM's library or a link to it.</p>
 <p>If you are having problems running on more than 32 nodes this is
   most likely your issue.</p>

 <h3>Job Scheduling</h3>

 <p>SLURM can be configured to gang schedule (time slice) parallel jobs by
 alternately suspending and resuming them. Depending upon the the number of
 jobs configured to time slice and the time slice interval (as specified in
 the <i>slurm.conf</i> file using the <b>Shared</b> and <b>SchedulerTimeSlice</b>
 options), the job may experience communication timeouts. Set the environment
 variable <b>MP_TIMEOUT</b> to specify an appropriate communication timeout
 value. Note that the default timeout is 150 seconds. See
 <a href="gang_scheduling.html">Gang Scheduling</a> for more information.</p>
 <pre>
 export MP_TIMEOUT=600
 </pre>

 <p>SLURM also can support long term job preemption with IBM's Parallel
 Environment. Job's can be explicitly preempted and later resumed using the
 <b>scontrol suspend &lt;jobid&gt;</b> and <b>scontrol resume &lt;jobid&gt;</b>
 commands. This functionality relies upon NRT functions to suspend/resume
 programs and reset MPI timeouts. Note that SLURM supports the preemption only
 of whole jobs rather than individual job steps. A suspended job will relinquish
 CPU resources, but retain memory and switch window resources. Note that the
 long term suspension of jobs with any allocated Collective Acceleration
 Units (CAU) is disabled and an error message to that effect will be generated
 in response to such a request. In addition, version 1200 or higher of IBM's NRT
 API is required to support this functionality.</p>

 <h3>Design Notes</h3>

 <p>It is necessary for all nodes that can be used for scheduling a single job
 have the same network adapter types and count. For example, if node "tux1"
 has two ethernet adapters then the node "tux2" in the same cluster must also
 have two ethernet adapters on the same networks or be in a different SLURM
 partition so that one job can not be allocated resources on both nodes.
 Without this restriction, a job may allocated adapter resources on one node
 and be unable to allocate the corresponding adapter resources on another
 node.</p>

 <p>It is possible to configure SLURM and LoadLeveler to simultaneously exist
 on a cluster, however each scheduler must be configured to manage different
 compute nodes (e.g. LoadLeveler can manage compute nodes "tux[1-8]" and SLURM
 can manage compute nodes "tux[9-16]" on the same cluster). In addition, the
 /etc/poe.limits file on each node must identify the MP_PE_RMLIB appropriate
 for that node (e.g. IBM's or SLURM's libpermapi.so).
 If Slurm and LoadLeveler are configured to simultaneously manage the same
 nodes, you should expect both resource managers to try assigning the same
 resources. This will result in job failures.</p>

 <p>The srun command uses the <b>launch/poe</b> plugin to launch the poe program.
 Then poe uses the <b>launch/slurm</b> plugin to launch the "pmd" process on the
 compute nodes, so two launch plugins are actually used.</p>

 <p>Depending upon job size and network options, allocating and deallocating
 switch resources can take multiple seconds per node and the process of launching
 applications on multiple nodes is not well parallelized.
 This is outside of SLURM's control.</p>

 <p>The two figures below show a high-level overview of the switch/nrt and
 launch/poe plugins. A typical Slurm installation with IBM PE would make
 use of both plugins, but the operation of each is shown independently for
 improved clarity. Note the the switch/nrt plugin is needed by not only the
 slurmd daemon, but also the slurmctld daemon (for managing switch allocation
 data structures) and the srun command (for packing and unpacking switch
 allocation information used at task launch time).
 In figure 2, note that the libpermapi library issues the job and job step
 creation requests. The srun command is an optional front-end for the poe
 command and the poe command can be invoked directly by the user if desired.</p>

 <img src="ibm_pe_fig1.png" width=600>
 <center>
 <p>Figure 1: Use of the switch/nrt plugin</p>
 </center>

 <img src="ibm_pe_fig2.png" width=600>
 <center>
 <p>Figure 2: Use of the launch/poe plugin</p>
 </center>

 <h3>Debugging Notes</h3>

 <p>It is possible to generate detailed logging of all switch/nrt actions and
 data by configuring <b>DebugFlags=switch</b>.</p>

 <p>The environment variable <b>MP_INFOLEVEL</b> can be used to enable the
 logging of POE debug messages. To enable fairly detailed logging, set
 <b>MP_INFOLEVEL=6</b>.</p>

 <p>The Protocol Network Services Daemon (PNSD) manages the Network Resource
 Table (NRT) information on each node. It's logs are written to the file
 <b>/tmp/serverlog</b>, which may be useful to diagnose problems. In order to
 execute PNSD in debug node (for extra debugging information), run the following
 commands as user root:</p>
 <pre>
 stopsrc -s pnsd
 startsrc -s pnsd -a -D
 </pre>

 <p class="footer"><a href="#top">top</a></p>

 <p style="text-align:center;">Last modified 14 March 2013</p></td>

 <!--#include virtual="footer.txt"-->