doc/html/cray.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>SLURM User and Administrator Guide for Cray systems</h1>

 <h2>User Guide</h2>

 <p>This document describes the unique features of SLURM on Cray computers.
 You should be familiar with the SLURM's mode of operation on Linux clusters
 before studying the differences in Cray system operation described in this
 document.</p>

 <p>SLURM version 2.3 is designed to operate as a job scheduler over Cray's
 Application Level Placement Scheduler (ALPS).
 Use SLURM's <i>sbatch</i> or <i>salloc</i> commands to create a resource
 allocation in ALPS.
 Then use ALPS' <i>aprun</i> command to launch parallel jobs within the resource
 allocation.
 The resource allocation is terminated once the the batch script or the
 <i>salloc</i> command terminates.
 Alternately there is an <i>aprun</i> wrapper distributed with SLURM in
 <i>contribs/cray/srun</i> which will translate <i>srun</i> options
 into the equivalent <i>aprun</i> options. This wrapper will also execute
 <i>salloc</i> as needed to create a job allocation in which to run the
 <i>aprun</i> command. The <i>srun</i> script contains two new options:
 <i>--man</i> will print a summary of the options including notes about which
 <i>srun</i> options are not supported and <i>--alps="</i> which can be used
 to specify <i>aprun</i> options which lack an equivalent within <i>srun</i>.
 For example, <i>srun --alps="-a xt" -n 4 a.out</i>.
 Since <i>aprun</i> is used to launch tasks (the equivalent of a SLURM
 job step), the job steps will not be visible using SLURM commands.
 Other than SLURM's <i>srun</i> command being replaced by <i>aprun</i>
 and the job steps not being visible, all other SLURM commands will operate
 as expected. Note that in order to build and install the aprun wrapper
 described above, execute "configure" with the <i>--with-srun2aprun</i>
 option or add <i>%_with_srun2aprun  1</i> to your <i>~/.rpmmacros</i> file.</p>

 <h3>Node naming and node geometry on Cray XT/XE systems</h3>
 <p>SLURM node names will be of the form "nid#####" where "#####" is a five-digit sequence number.
    Other information available about the node are it's XYZ coordinate in the node's <i>NodeAddr</i>
    field and it's component label in the <i>HostNodeName</i> field.
    The format of the component label is "c#-#c#s#n#" where the "#" fields represent in order:
    cabinet, row, cage, blade or slot, and node.
    For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.</p>

 <p>Cray XT/XE systems come with a 3D torus by default. On smaller systems the cabling in X dimension is
    omitted, resulting in a two-dimensional torus (1 x Y x Z). On Gemini/XE systems, pairs of adjacent nodes
    (nodes 0/1 and 2/3 on each blade) share one network interface each. This causes the same Y coordinate to
    be  assigned to those nodes, so that the number of distinct torus coordinates is half the number of total
    nodes.</p>
 <p>The SLURM <i>smap</i> and <i>sview</i> tools can visualize node torus positions. Clicking on a particular
    node shows its <i>NodeAddr</i> field, which is its (X,Y,Z) torus coordinate base-36 encoded as a 3-character
    string. For example, a NodeAddr of '07A' corresponds to the coordinates X = 0, Y = 7, Z = 10.
    The <i>NodeAddr</i> of a node can also be shown using 'scontrol show node nid#####'.</p>

 <p>Please note that the sbatch/salloc options "<i>--geometry</i>" and "<i>--no-rotate</i>" are BlueGene-specific
    and have no impact on Cray systems. Topological node placement depends on what Cray makes available via the
    ALPS_NIDORDER configuration option (see below).</p>

 <h3>Specifying thread depth</h3>
 <p>For threaded applications, use the <i>--cpus-per-task</i>/<i>-c</i> parameter of sbatch/salloc to set
    the thread depth per node. This corresponds to mppdepth in PBS and to the aprun -d parameter. Please
    note that SLURM does not set the OMP_NUM_THREADS environment variable. Hence, if an application spawns
    4 threads, an example script would look like</p>
 <pre>
  #SBATCH --comment="illustrate the use of thread depth and OMP_NUM_THREADS"
  #SBATCH --ntasks=3
  #SBATCH -c 4
  export OMP_NUM_THREADS=4
  aprun -n 3 -d $OMP_NUM_THREADS ./my_exe
 </pre>

 <h3>Specifying number of tasks per node</h3>
 <p>SLURM uses the same default as ALPS, assigning each task to a single core/CPU. In order to
    make more resources available per task, you can reduce the number of processing elements
    per node (<i>aprun -N</i> parameter, <i>mppnppn</i> in PBS) with the
    <i>--ntasks-per-node</i> option of <i>sbatch/salloc</i>.
    This is in particular necessary when tasks require more memory than the per-CPU default.</p>

 <h3>Specifying per-task memory</h3>
 <p>In Cray terminology, a task is also called a "processing element" (PE), hence below we
    refer to the per-task memory and "per-PE" memory interchangeably. The per-PE memory
    requested through the batch system corresponds to the <i>aprun -m</i> parameter.</p>

 <p>Due to the implicit default assumption that 1 task runs per core/CPU, the default memory
    available per task is the <i>per-CPU share</i> of node_memory / number_of_cores. For
    example, on a XT5 system with 16000MB per 12-core node, the per-CPU share is 1333MB.</p>

 <p>If nothing else is specified, the <i>--mem</i> option to sbatch/salloc can only be used to
    <i>reduce</i> the per-PE memory below the per-CPU share. This is also the only way that
    the <i>--mem-per-cpu</i> option can be applied (besides, the <i>--mem-per-cpu</i> option
    is ignored if the user forgets to set --ntasks/-n).
    Thus, the preferred way of specifying  memory is the more general <i>--mem</i> option.</p>

 <p>To <i>increase</i> the per-PE memory settable via the <i>--mem</i> option requires making
    more per-task resources available using the <i>--ntasks-per-node</i> option to sbatch/salloc.
    This allows <i>--mem</i> to request up to node_memory / ntasks_per_node MegaBytes.</p>

 <p>When <i>--ntasks-per-node</i> is 1, the entire node memory may be requested by the application.
    Setting <i>--ntasks-per-node</i> to the number of cores per node yields the default per-CPU share
    minimum value.</p>

 <p>For all cases in between these extremes, set --mem=per_task_memory and</p>
 <pre>
    --ntasks-per-node=floor(node_memory / per_task_memory)
 </pre>
 <p>whenever per_task_memory needs to be larger than the per-CPU share.</p>

 <p><b>Example:</b> An application with 64 tasks needs 7500MB per task on a cluster with 32000MB and 24 cores
    per node.  Hence  ntasks_per_node = floor(32000/7500) = 4.</p>
 <pre>
     #SBATCH --comment="requesting 7500MB per task on 32000MB/24-core nodes"
     #SBATCH --ntasks=64
     #SBATCH --ntasks-per-node=4
     #SBATCH --mem=7500
 </pre>
 <p>If you would like to fine-tune the memory limit of your application, you can set the same parameters in
    a salloc session and then check directly, using</p>
 <pre>
     apstat -rvv -R $BASIL_RESERVATION_ID
 </pre>
 <p>to see how much memory has been requested.</p>

 <h3>Using aprun -B</h3>
 <p>CLE 3.x allows a nice <i>aprun</i> shortcut via the <i>-B</i> option, which
    reuses all the batch system parameters (<i>--ntasks, --ntasks-per-node,
    --cpus-per-task, --mem</i>) at application launch, as if the corresponding
    (<i>-n, -N, -d, -m</i>) parameters had been set; see the aprun(1) manpage
    on CLE 3.x systems for details.</p>

 <h3>Node ordering options</h3>
 <p>SLURM honours the node ordering policy set for Cray's Application Level Placement Scheduler (ALPS). Node
    ordering is a configurable system option (ALPS_NIDORDER in /etc/sysconfig/alps). The current
    setting is reported by '<i>apstat -svv</i>'  (look for the line starting with "nid ordering option") and
    can not be changed at  runtime. The resulting, effective node ordering is revealed by '<i>apstat -no</i>'
    (if no special node ordering has been configured, 'apstat -no' shows the
    same order as '<i>apstat -n</i>').</p>

 <p>SLURM uses exactly the same order as '<i>apstat -no</i>' when selecting
    nodes for a job. With the <i>--contiguous</i> option to <i>sbatch/salloc</i>
    you can request a contiguous (relative to the current ALPS nid ordering) set
    of nodes. Note that on a busy system there is typically more fragmentation,
    hence it may take longer (or even prove impossible) to allocate contiguous
    sets of a larger size.</p>

 <p>Cray/ALPS node ordering is a topic of ongoing work, some information can be found in the CUG-2010 paper
    "<i>ALPS, Topology, and Performance</i>" by Carl Albing and Mark Baker.</p>

 <h2>Administrator Guide</h2>

 <h3>Install supporting RPMs</h3>

 <p>The build requires a few -devel RPMs listed below. You can obtain these from
 SuSe/Novell.
 <ul>
 <li>CLE 2.x uses SuSe SLES 10 packages (rpms may be on the normal isos)</li>
 <li>CLE 3.x uses Suse SLES 11 packages (rpms are on the SDK isos, there
 are two SDK iso files for SDK)</li>
 </ul></p>

 <p>You can check by logging onto the boot node and running</p>
 <pre>
 boot: # xtopview
 default: # rpm -qa
 </pre>

 <p>The list of packages that should be installed is:</p>
 <ul>
 <li>expat-2.0.xxx</li>
 <li>libexpat-devel-2.0.xxx</li>
 <li>cray-MySQL-devel-enterprise-5.0.64 (this should be on the Cray iso)</li>
 </ul>

 <p>For example, loading MySQL can be done like this:</p>
 <pre>
 smw: # mkdir mnt
 smw: # mount -o loop, ro xe-sles11sp1-trunk.201107070231a03.iso mnt
 smw: # find mnt -name cray-MySQL-devel-enterprise\*
 mnt/craydist/xt-packages/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64.rpm
 smw: # scp mnt/craydist/xt-packages/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64
 </pre>

 <p>Then switch to boot node and run:</p>
 <pre>
 boot: # xtopview
 default: # rpm -ivh /software/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64.rpm
 default: # exit
 </pre>

 <p>All Cray-specific PrgEnv and compiler modules should be removed and root
 privileges will be required to install these files.</p>

 <h3>Create a build root</h3>

 <p>The build is done on a normal service node, where you like
 (e.g. <i>/ufs/slurm/build</i> would work).
 Most scripts check for the environment variable LIBROOT.
 You can either edit the scripts or export this variable. Easiest way:</p>

 <pre>
 login: # export LIBROOT=/ufs/slurm/build
 login: # mkdir -vp $LIBROOT
 login: # cd $LIBROOT
 </pre>

 <h3>Install SLURM modulefile</h3>

 <p>This file is distributed as part the SLURM tar-ball in
 <i>contribs/cray/opt_modulefiles_slurm</i>. Install it as
 <i>/opt/modulefiles/slurm</i> (or anywhere else in your module path).
 It means that you can use Munge as soon as it is built.</p>
 <pre>
 login: # scp ~/slurm/contribs/cray/opt_modulefiles_slurm root@boot:/rr/current/software/
 </pre>

 <h3>Build and install Munge</h3>

 <p>Note the Munge installation process on Cray systems differs
 somewhat from that described in the
 <a href="http://code.google.com/p/munge/wiki/InstallationGuide">
 MUNGE Installation Guide</a>.</p>

 <p>Munge is the authentication daemon and needed by SLURM. Download
 munge-0.5.10.tar.bz2 or newer from
 <a href="http://code.google.com/p/munge/downloads/list">
 http://code.google.com/p/munge/downloads/list</a>. This is how one
 can build on a login node and install it.</p>
 <pre>
 login: # cd $LIBROOT
 login: # cp ~/slurm/contribs/cray/munge_build_script.sh $LIBROOT
 login: # mkdir -p ${LIBROOT}/munge/zip
 login: # curl -O http://munge.googlecode.com/files/munge-0.5.10.tar.bz2
 login: # cp munge-0.5.10.tar.bz2 ${LIBROOT}/munge/zip
 login: # chmod u+x ${LIBROOT}/munge/zip/munge_build_script.sh
 login: # ${LIBROOT}/munge/zip/munge_build_script.sh
 (generates lots of output and enerates a tar-ball called
 $LIBROOT/munge_build-.*YYYY-MM-DD.tar.gz)
 login: # scp munge_build-2011-07-12.tar.gz root@boot:/rr/current/software
 </pre>

 <p>Install the tar-ball by on the boot node and build an encryption
 key file executing:
 <pre>
 boot: # xtopview
 default: # tar -zxvf $LIBROOT/munge_build-*.tar.gz -C /rr/current /
 default: # dd if=/dev/urandom bs=1 count=1024 >/opt/slurm/munge/etc/munge.key
 default: # chmod go-rxw /opt/slurm/munge/etc/munge.key
 default: # exit
 </pre>

 <h3>Configure Munge</h3>

 <p>The following steps apply to each login node and the sdb, where
 <ul>
 <li>The <i>slurmd</i> or <i>slurmctld</i> daemon will run and/or</li>
 <li>Users will be submitting jobs</li>
 </ul></p>

 <pre>
 login: # mkdir --mode=0711 -vp /var/lib/munge
 login: # mkdir --mode=0700 -vp /var/log/munge
 login: # mkdir --mode=0755 -vp /var/run/munge
 login: # module load slurm
 </pre>
 <pre>
 sdb: # mkdir --mode=0711 -vp /var/lib/munge
 sdb: # mkdir --mode=0700 -vp /var/log/munge
 sdb: # mkdir --mode=0755 -vp /var/run/munge
 </pre>

 <p>Start the munge daemon and test it.</p>
 <pre>
 login: # munged --key-file /opt/slurm/munge/etc/munge.key
 login: # munge -n
 MUNGE:AwQDAAAEy341MRViY+LacxYlz+mchKk5NUAGrYLqKRUvYkrR+MJzHTgzSm1JALqJcunWGDU6k3vpveoDFLD7fLctee5+OoQ4dCeqyK8slfAFvF9DT5pccPg=:
 </pre>

 <p>When done, verify network connectivity by executing:
 <ul>
 <li><i>munge -n | ssh other-login-host /opt/slurm/munge/bin/unmunge</i></li>
 </ul>


 <p>If you decide to keep the installation, you may be interested in automating
 the process using an <i>init.d</i> script distributed with the Munge. This
 should be installed on all nodes running munge, e.g., 'xtopview -c login' and
 'xtopview -n sdbNodeID'
 </p>
 <pre>
 boot: # xtopview -c login
 login: # cp /software/etc_init_d_munge /etc/init.d/munge
 login: # chmod u+x /etc/init.d/munge
 login: # chkconfig munge on
 login: # exit
 boot: # xtopview -n 31
 node/31: # cp /software/etc_init_d_munge /etc/init.d/munge
 node/31: # chmod u+x /etc/init.d/munge
 node/31: # chkconfig munge on
 node/31: # exit
 </pre>

 <h3>Enable the Cray job service</h3>

 <p>This is a common dependency on Cray systems. ALPS relies on the Cray job service to
    generate cluster-unique job container IDs (PAGG IDs). These identifiers are used by
    ALPS to track running (aprun) job steps. The default (session IDs) is not unique
    across multiple login nodes. This standard procedure is described in chapter 9 of
    <a href="http://docs.cray.com/books/S-2393-30/">S-2393</a> and takes only two
    steps, both to be done on all 'login' class nodes (xtopview -c login):</p>
    <ul>
 	   <li>make sure that the /etc/init.d/job service is enabled (chkconfig) and started</li>
 	   <li>enable the pam_job.so module from /opt/cray/job/default in /etc/pam.d/common-session<br/>
 	   (NB: the default pam_job.so is very verbose, a simpler and quieter variant is provided
 		in contribs/cray.)</li>
    </ul>
 <p>The latter step is required only if you would like to run interactive
    <i>salloc</i> sessions.</p>
 <pre>
 boot: # xtopview -c login
 login: # chkconfig job on
 login: # emacs -nw /etc/pam.d/common-session
 (uncomment the pam_job.so line)
 session optional /opt/cray/job/default/lib64/security/pam_job.so
 login: # exit
 boot: # xtopview -n 31
 node/31:# chkconfig job on
 node/31:# emacs -nw /etc/pam.d/common-session
 (uncomment the pam_job.so line as shown above)
 </pre>

 <h3>Build and Configure SLURM</h3>

 <p>SLURM can be built and installed as on any other computer as described
 <a href="quickstart_admin.html">Quick Start Administrator Guide</a>.
 An example of building and installing SLURM version 2.3.0 is shown below.</p>

 <p><b>NOTE:</b> By default neither the <i>salloc</i> command or <i>srun</i>
 command wrapper can be executed as a background process. This is done for two
 reasons:</p>
 <ol>
 <li>Only one ALPS reservation can be created from each session ID. The
 <i>salloc</i> command can not change it's session ID without disconnecting
 itself from the terminal and its parent process, meaning the process could not
 be later put into the foreground or easily identified</li>
 <li>To better identify every process spawned under the <i>salloc</i> process
 using terminal foreground process group IDs</li>
 </ol>
 <p>You can optionally enable <i>salloc</i> and <i>srun</i> to execute as
 background processes by using the configure option
 <i>"--enable-salloc-background"</i>, however doing will result in failed
 resource allocations
 (<i>error: Failed to allocate resources: Requested reservation is in use</i>)
 if not executed sequentially and
 increase the likelyhood of orphaned processes.</p>
 <!-- Example:
  Modify srun script or ask user to execute "/usr/bin/setsid"
  before salloc or srun command -->
 <!-- Example:
  salloc spawns zsh, zsh spawns bash, etc.
  when salloc terminates, bash becomes a child of init -->

 <pre>
 login: # mkdir build && cd build
 login: # slurm/configure \
   --prefix=/opt/slurm/2.3.0 \
   --with-munge=/opt/slurm/munge/ \
   --with-mysql_config=/opt/cray/MySQL/5.0.64-1.0000.2899.20.2.gem/bin \
   --with-srun2aprun
 login: # make -j
 login: # mkdir install
 login: # make DESTDIR=/tmp/slurm/build/install install
 login: # make DESTDIR=/tmp/slurm/build/install install-contrib
 login: # cd install
 login: # tar czf slurm_opt.tar.gz opt
 login: # scp slurm_opt.tar.gz boot:/rr/current/software
 </pre>

 <pre>
 boot: # xtopview
 default: # tar xzf /software/slurm_opt.tar.gz -C /
 default: # cd /opt/slurm/
 default: # ln -s 2.3.0 default
 </pre>

 <p>When building SLURM's <i>slurm.conf</i> configuration file, use the
 <i>NodeName</i> parameter to specify all batch nodes to be scheduled.
 If nodes are defined in ALPS, but not defined in the <i>slurm.conf</i> file, a
 complete list of all batch nodes configured in ALPS will be logged by
 the <i>slurmctld</i> daemon when it starts.
 One would typically use this information to modify the <i>slurm.conf</i> file
 and restart the <i>slurmctld</i> daemon.
 Note that the <i>NodeAddr</i> and <i>NodeHostName</i> fields should not be
 configured, but will be set by SLURM using data from ALPS.
 <i>NodeAddr</i> be set to the node's XYZ coordinate and be used by SLURM's
 <i>smap</i> and <i>sview</i> commands.
 <i>NodeHostName</i> will be set to the node's component label.
 The format of the component label is "c#-#c#s#n#" where the "#" fields
 represent in order: cabinet, row, cate, blade or slot, and node.
 For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.</p>

 <p>The <i>slurmd</i> daemons will not execute on the compute nodes, but will
 execute on one or more front end nodes.
 It is from here that batch scripts will execute <i>aprun</i> commands to
 launch tasks.
 This is specified in the <i>slurm.conf</i> file by using the
 <i>FrontendName</i> and optionally the <i>FrontEndAddr</i> fields
 as seen in the examples below.</p>

 <p>Note that SLURM will by default kill running jobs when a node goes DOWN,
 while a DOWN node in ALPS only prevents new jobs from being scheduled on the
 node. To help avoid confusion, we recommend that <i>SlurmdTimeout</i> in the
 <i>slurm.conf</i> file be set to the same value as the <i>suspectend</i>
 parameter in ALPS' <i>nodehealth.conf</i> file.</p>

 <p>You need to specify the appropriate resource selection plugin (the
 <i>SelectType</i> option in SLURM's <i>slurm.conf</i> configuration file).
 Configure <i>SelectType</i> to <i>select/cray</i> The <i>select/cray</i>
 plugin provides an interface to ALPS plus issues calls to the
 <i>select/linear</i>, which selects resources for jobs using a best-fit
 algorithm to allocate whole nodes to jobs (rather than individual sockets,
 cores or threads).</p>

 <p>Note that the system topology is based upon information gathered from
 the ALPS database and is based upon the ALPS_NIDORDER configuration in
 <i>/etc/sysconfig/alps</i>. Excerpts of a <i>slurm.conf</i> file for
 use on a Cray systems follow:</p>

 <pre>
 #---------------------------------------------------------------------
 # SLURM USER
 #---------------------------------------------------------------------
 # SLURM user on cray systems must be root
 # This requirement derives from Cray ALPS:
 # - ALPS reservations can only be created by the job owner or root
 #   (confirmation may be done by other non-privileged users)
 # - Freeing a reservation always requires root privileges
 SlurmUser=root

 #---------------------------------------------------------------------
 # PLUGINS
 #---------------------------------------------------------------------
 # Network topology (handled internally by ALPS)
 TopologyPlugin=topology/none

 # Scheduling
 SchedulerType=sched/backfill

 # Node selection: use the special-purpose "select/cray" plugin.
 # Internally this uses select/linar, i.e. nodes are always allocated
 # in units of nodes (other allocation is currently not possible, since
 # ALPS does not yet allow to run more than 1 executable on the same
 # node, see aprun(1), section LIMITATIONS).
 #
 # Add CR_memory as parameter to support --mem/--mem-per-cpu.
 SelectType=select/cray
 SelectTypeParameters=CR_Memory

 # Proctrack plugin: only/default option is proctrack/sgi_job
 # ALPS requires cluster-unique job container IDs and thus the /etc/init.d/job
 # service needs to be started on all slurmd and login nodes, as described in
 # S-2393, chapter 9. Due to this requirement, ProctrackType=proctrack/sgi_job
 # is the default on Cray and need not be specified explicitly.

 #---------------------------------------------------------------------
 # PATHS
 #---------------------------------------------------------------------
 SlurmdSpoolDir=/ufs/slurm/spool
 StateSaveLocation=/ufs/slurm/spool/state

 # main logfile
 SlurmctldLogFile=/ufs/slurm/log/slurmctld.log
 # slurmd logfiles (using %h for hostname)
 SlurmdLogFile=/ufs/slurm/log/%h.log

 # PIDs
 SlurmctldPidFile=/var/run/slurmctld.pid
 SlurmdPidFile=/var/run/slurmd.pid

 #---------------------------------------------------------------------
 # COMPUTE NODES
 #---------------------------------------------------------------------
 # Return DOWN nodes to service when e.g. slurmd has been unresponsive
 ReturnToService=1

 # Configure the suspectend parameter in ALPS' nodehealth.conf file to the same
 # value as SlurmdTimeout for consistent behavior (e.g. "suspectend: 600")
 SlurmdTimeout=600

 # Controls how a node's configuration specifications in slurm.conf are
 # used.
 # 0 - use hardware configuration (must agree with slurm.conf)
 # 1 - use slurm.conf, nodes with fewer resources are marked DOWN
 # 2 - use slurm.conf, but do not mark nodes down as in (1)
 FastSchedule=2

 # Per-node configuration for PALU AMD G34 dual-socket "Magny Cours"
 # Compute Nodes. We deviate from slurm's idea of a physical socket
 # here, since the Magny Cours hosts two NUMA nodes each, which is
 # also visible in the ALPS inventory (4 Segments per node, each
 # containing 6 'Processors'/Cores).
 NodeName=DEFAULT Sockets=4 CoresPerSocket=6 ThreadsPerCore=1
 NodeName=DEFAULT RealMemory=32000 State=UNKNOWN

 # List the nodes of the compute partition below (service nodes are not
 # allowed to appear)
 NodeName=nid00[002-013,018-159,162-173,178-189]

 # Frontend nodes: these should not be available to user logins, but
 #                 have all filesystems mounted that are also
 #                 available on a login node (/scratch, /home, ...).
 FrontendName=palu[7-9]

 #---------------------------------------------------------------------
 # ENFORCING LIMITS
 #---------------------------------------------------------------------
 # Enforce the use of associations: {associations, limits, wckeys}
 AccountingStorageEnforce=limits

 # Do not propagate any resource limits from the user's environment to
 # the slurmd
 PropagateResourceLimits=NONE

 #---------------------------------------------------------------------
 # Resource limits for memory allocation:
 # * the Def/Max 'PerCPU' and 'PerNode' variants are mutually exclusive;
 # * use the 'PerNode' variant for both default and maximum value, since
 #   - slurm will automatically adjust this value depending on
 #     --ntasks-per-node
 #   - if using a higher per-cpu value than possible, salloc will just
 #     block.
 #--------------------------------------------------------------------
 # XXX replace both values below with your values from 'xtprocadmin -A'
 DefMemPerNode=32000
 MaxMemPerNode=32000

 #---------------------------------------------------------------------
 # PARTITIONS
 #---------------------------------------------------------------------
 # defaults common to all partitions
 PartitionName=DEFAULT Nodes=nid00[002-013,018-159,162-173,178-189]
 PartitionName=DEFAULT MaxNodes=178
 PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60

 # "User Support" partition with a higher priority
 PartitionName=usup Hidden=YES Priority=10 MaxTime=720 AllowGroups=staff

 # normal partition available to all users
 PartitionName=day Default=YES Priority=1 MaxTime=01:00:00
 </pre>

 <p>SLURM supports an optional <i>cray.conf</i> file containing Cray-specific
 configuration parameters. <b>This file is NOT needed for production systems</b>,
 but is provided for advanced configurations. If used, <i>cray.conf</i> must be
 located in the same directory as the <i>slurm.conf</i> file. Configuration
 parameters supported by <i>cray.conf</i> are listed below.</p>

 <p><dl>
 <dt><b>apbasil</b></dt>
 <dd>Fully qualified pathname to the apbasil command.
 The default value is <i>/usr/bin/apbasil</i>.</dd>
 <dt><b>apkill</b></dt>
 <dd>Fully qualified pathname to the apkill command.
 The default value is <i>/usr/bin/apkill</i>.</dd>
 <dt><b>SDBdb</b></dt>
 <dd>Name of the ALPS database.
 The default value is <i>XTAdmin</i>.</dd>
 <dt><b>SDBhost</b></dt>
 <dd>Hostname of the database server.
 The default value is based upon the contents of the 'my.cnf' file used to
 store default database access information and that defaults to user 'sdb'.</dd>
 <dt><b>SDBpass</b></dt>
 <dd>Password used to access the ALPS database.
 The default value is based upon the contents of the 'my.cnf' file used to
 store default database access information and that defaults to user 'basic'.</dd>
 <dt><b>SDBport</b></dt>
 <dd>Port used to access the ALPS database.
 The default value is 0.</dd>
 <dt><b>SDBuser</b></dt>
 <dd>Name of user used to access the ALPS database.
 The default value is based upon the contents of the 'my.cnf' file used to
 store default database access information and that defaults to user 'basic'.</dd>
 </dl></p>

 <pre>
 # Example cray.conf file
 apbasil=/opt/alps_simulator_40_r6768/apbasil.sh
 SDBhost=localhost
 SDBuser=alps_user
 SDBdb=XT5istanbul
 </pre>

 <p>One additional configuration script can be used to insure that the slurmd
 daemons execute with the highest resource limits possible, overriding default
 limits on Suse systems. Depending upon what resource limits are propagated
 from the user's environment, lower limits may apply to user jobs, but this
 script will insure that higher limits are possible. Copy the file
 <i>contribs/cray/etc_sysconfig_slurm</i> into <i>/etc/sysconfig/slurm</i>
 for these limits to take effect. This script is executed from
 <i>/etc/init.d/slurm</i>, which is typically executed to start the SLURM
 daemons. An excerpt of <i>contribs/cray/etc_sysconfig_slurm</i>is shown
 below.</p>

 <pre>
 #
 # /etc/sysconfig/slurm for Cray XT/XE systems
 #
 # Cray is SuSe-based, which means that ulimits from
 # /etc/security/limits.conf will get picked up any time SLURM is
 # restarted e.g. via pdsh/ssh. Since SLURM respects configured limits,
 # this can mean that for instance batch jobs get killed as a result
 # of configuring CPU time limits. Set sane start limits here.
 #
 # Values were taken from pam-1.1.2 Debian package
 ulimit -t unlimited	# max amount of CPU time in seconds
 ulimit -d unlimited	# max size of a process's data segment in KB
 </pre>

 <p>SLURM's <i>init.d</i> script should also be installed to automatically
 start SLURM daemons when nodes boot as shown below. Be sure to edit the script
 as appropriate to reference the proper file location (modify the variable
 <i>PREFIX</i>).

 <pre>
 login: # scp /home/crayadm/ben/slurm/etc/init.d.slurm boot:/rr/current/software/
 </pre>

 <p>SLURM will ignore any interactive jobs or nodes in interactive mode
 so set all your nodes to batch from any service node. Dropping the
 -n option will make all nodes batch.</p>

 <pre>
 # xtprocadmin -k m batch -n NODEIDS
 </pre>

 <p>Now create the needed directories for logs and state files then start the
 daemons on the sdb and login nodes as shown below.</p>

 <pre>
 sdb: # mkdir -p /ufs/slurm/log
 sdb: # mkdir -p /ufs/slurm/spool
 sdb: # /etc/init.d/slurm start
 </pre>

 <pre>
 login: # /etc/init.d/slurm start
 </pre>

 <h3>Srun wrapper configuration</h3>

 <p>The <i>srun</i> wrapper to <i>aprun</i> might require modification to run
 as desired. Specifically the <i>$aprun</i> variable could be set to the
 absolute pathname of that executable file. Without that modification, the
 <i>aprun</i> command executed will depend upon the user's search path.</p>

 <p>In order to debug the <i>srun</i> wrapper, uncomment the line</p>
 <pre>
 print "comment=$command\n"
 </pre>
 <p>If the <i>srun</i> wrapper is executed from
 within an existing SLURM job allocation (i.e. within <i>salloc</i> or an
 <i>sbatch</i> script), then it just executes the <i>aprun</i> command with
 appropriate options. If executed without an allocation, the wrapper executes
 <i>salloc</i>, which then executes the <i>srun</i> wrapper again. This second
 execution of the <i>srun</i> wrapper is required in order to process environment
 variables that are set by the <i>salloc</i> command based upon the resource
 allocation.</p>

 <p class="footer"><a href="#top">top</a></p>

 <p style="text-align:center;">Last modified 13 March 2012</p></td>

 <!--#include virtual="footer.txt"-->